Hi Rahul,

I don't have a direct answer to your question as I don't know of any
S3 based Directory implementation. Such an implementation would likely
be more complex than an HDFS one.

Reason is S3 has eventual consistency. When an S3 file is updated you
might still read the old content for a while, and similarly a
"directory listing" in S3 might not immediately show recently added
files. This basically requires storing metadata elsewhere (see for
example https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
and 
https://blog.cloudera.com/introducing-s3guard-s3-consistency-for-apache-hadoop/).

At Salesforce we are working on storing SolrCloud search indexes on
S3/GCS in a way that allows these indexes to be shared between
replicas. We use Zookeeper to deal with S3's eventual consistency.
See Activate presentation last year
https://www.youtube.com/watch?v=6fE5KvOfb6A, and code at
https://github.com/apache/lucene-solr/tree/jira/SOLR-13101 (SHARED
replica type is a good entry point for looking at the changes).

In a nutshell, we use nodes' local disks as a cache that can be lost
when a node fails, and read/write segments from/to S3 as needed.
Queries are then always served from local disk and indexing always
happens locally then pushed to S3.

Could a solution to your use case be built instead using S3 based
backup/restore? Would require the right data partitioning (to get
reasonable restore then query latency for cold data and reasonable
backup time for modified indexes) and likely a friendly indexing
pipeline that can resubmit data indexed since last backup...

Ilan

On Fri, Apr 24, 2020 at 5:06 AM dhurandar S <[email protected]> wrote:
>
> Hi Jan,
>
> Thank you for your reply. The reason we are looking for S3 is since the 
> volume is close to 10 Petabytes.
> We are okay to have higher latency of say twice or thrice that of placing 
> data on the local disk. But we have a requirement to have long-range data and 
> providing Seach capability on that.  Every other storage apart from S3 turned 
> out to be very expensive at that scale.
>
> Basically I want to replace
>
> -Dsolr.directoryFactory=HdfsDirectoryFactory \
>
>  with S3 based implementation.
>
>
> regards,
> Rahul
>
>
>
>
>
> On Thu, Apr 23, 2020 at 3:12 AM Jan Høydahl <[email protected]> wrote:
>>
>> Hi,
>>
>> Is your data so partitioned that it makes sense to consider splitting up
>> in multiple collections and make some arrangement that will keep only
>> a few collections live at a time, loading index files from S3 on demand?
>>
>> I cannot see how an S3 directory would be able to effectively cache files
>> in S3 and what units the index files would be stored as?
>>
>> Have you investigated EFS as an alternative? That would look like a
>> normal filesystem to Solr but might be cheaper storage wise, but much
>> slower.
>>
>> Jan
>>
>> > 23. apr. 2020 kl. 06:57 skrev dhurandar S <[email protected]>:
>> >
>> > Hi,
>> >
>> > I am looking to use S3 as the place to store indexes. Just how Solr uses
>> > HdfsDirectory to store the index and all the other documents.
>> >
>> > We want to provide a search capability that is okay to be a little slow but
>> > cheaper in terms of the cost. We have close to 2 petabytes of data on which
>> > we want to provide the Search using Solr.
>> >
>> > Are there any open-source implementations around using S3 as the Directory
>> > for Solr ??
>> >
>> > Any recommendations on this approach?
>> >
>> > regards,
>> > Rahul
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to