[jira] [Updated] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

Mark Miller (JIRA) Mon, 16 Sep 2013 14:16:31 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mark Miller updated SOLR-1301:
------------------------------

    Attachment: SOLR-1301.patch

This is likely the last patch I'll put up for a bit - I'm on vacation from 
Wed-Mon.

Patch Notes:

ant precommit passes again. I've fixed the forbidden api calls and a couple 
minor javadoc issues in the new morphlines code. Also fixed a more problematic 
javadocs issue due to broken links from the morphlines code to extraction code 
due to extending extraction classes.

I've added tika-xmp to the extraction dependencies.

I don't like that tests can pass when some necessary run-time jars are missing 
- we will likely need to look into adding simple tests that cause each 
necessary jar to be used - or even just have hack tests that try and create a 
class in the offending jars or something. I'll save that for a follow up issue 
though - the solr cell morphlines tests actually upped the number of 
dependencies tests hit quite a bit at least.

There is also a test speed issue that is not on the critical path - on my fast 
machine that does 8 tests in parallel, this adds about 4-5 minutes to the 
tests. It would be good to try and minimize some of the longer tests for std 
runs, and keep them as is for @nightly runs. That can wait post commit though.

That leaves the following 2 critical path items to deal with:

* Get the tests to run without a hacked test.policy file.
* Dist packaging. This includes things like creation of the final 
MapReduceIndexerTool jar file and dealing with it's dependencies, as well as 
the location of the morphlines code and how it is distributed.

Other than that we are looking pretty good - all tests passing and precommit 
passing.


                
> Add a Solr contrib that allows for building Solr indexes via Hadoop's 
> Map-Reduce.
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Andrzej Bialecki 
>            Assignee: Mark Miller
>             Fix For: 4.5, 5.0
>
>         Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
> hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

Reply via email to