I have written spark lucene integration as part of Verizon trapezium/dal project...you can extract the data stored in hdfs indices and feed it to spark...
https://github.com/Verizon/trapezium/tree/master/dal/src/test/scala/com/verizon/bda/trapezium/dal I intend to publish it as spark package as soon as I get time. You can use spark-solr or spark-elastic but I did not want to bring solr elastic dependency to be performant... Thanks. Deb On Mar 13, 2018 4:31 PM, "Tom Hirschfeld" <tomhirschf...@gmail.com> wrote: Hello! *Background*: My team is running a machine learning pipeline, and part of the pipeline is an http scrape of a web based Lucene application via http calls. The scrape outputs a CSV file that we then upload to HDFS and use it as input to run a spark ML job. *Question: *Is there a way for our spark application to read from a lucene index stored in HDFS? Specifically, I see here <http://lucene.apache.org/solr/6_5_0/solr-core/org/apache/solr/store/hdfs/ HdfsDirectory.html> that solr-core has an hdfs directory type that seems to be compatible with our lucene indexreader. Is this compatible? Are we able to store our index in HDFS and read from a spark job? Best, Tom Hirschfeld