Hi Rahul, I don't have a direct answer to your question as I don't know of any S3 based Directory implementation. Such an implementation would likely be more complex than an HDFS one.
Reason is S3 has eventual consistency. When an S3 file is updated you might still read the old content for a while, and similarly a "directory listing" in S3 might not immediately show recently added files. This basically requires storing metadata elsewhere (see for example https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html and https://blog.cloudera.com/introducing-s3guard-s3-consistency-for-apache-hadoop/). At Salesforce we are working on storing SolrCloud search indexes on S3/GCS in a way that allows these indexes to be shared between replicas. We use Zookeeper to deal with S3's eventual consistency. See Activate presentation last year https://www.youtube.com/watch?v=6fE5KvOfb6A, and code at https://github.com/apache/lucene-solr/tree/jira/SOLR-13101 (SHARED replica type is a good entry point for looking at the changes). In a nutshell, we use nodes' local disks as a cache that can be lost when a node fails, and read/write segments from/to S3 as needed. Queries are then always served from local disk and indexing always happens locally then pushed to S3. Could a solution to your use case be built instead using S3 based backup/restore? Would require the right data partitioning (to get reasonable restore then query latency for cold data and reasonable backup time for modified indexes) and likely a friendly indexing pipeline that can resubmit data indexed since last backup... Ilan On Fri, Apr 24, 2020 at 5:06 AM dhurandar S <[email protected]> wrote: > > Hi Jan, > > Thank you for your reply. The reason we are looking for S3 is since the > volume is close to 10 Petabytes. > We are okay to have higher latency of say twice or thrice that of placing > data on the local disk. But we have a requirement to have long-range data and > providing Seach capability on that. Every other storage apart from S3 turned > out to be very expensive at that scale. > > Basically I want to replace > > -Dsolr.directoryFactory=HdfsDirectoryFactory \ > > with S3 based implementation. > > > regards, > Rahul > > > > > > On Thu, Apr 23, 2020 at 3:12 AM Jan Høydahl <[email protected]> wrote: >> >> Hi, >> >> Is your data so partitioned that it makes sense to consider splitting up >> in multiple collections and make some arrangement that will keep only >> a few collections live at a time, loading index files from S3 on demand? >> >> I cannot see how an S3 directory would be able to effectively cache files >> in S3 and what units the index files would be stored as? >> >> Have you investigated EFS as an alternative? That would look like a >> normal filesystem to Solr but might be cheaper storage wise, but much >> slower. >> >> Jan >> >> > 23. apr. 2020 kl. 06:57 skrev dhurandar S <[email protected]>: >> > >> > Hi, >> > >> > I am looking to use S3 as the place to store indexes. Just how Solr uses >> > HdfsDirectory to store the index and all the other documents. >> > >> > We want to provide a search capability that is okay to be a little slow but >> > cheaper in terms of the cost. We have close to 2 petabytes of data on which >> > we want to provide the Search using Solr. >> > >> > Are there any open-source implementations around using S3 as the Directory >> > for Solr ?? >> > >> > Any recommendations on this approach? >> > >> > regards, >> > Rahul >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
