[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214152#comment-13214152
]
Alexander Kanarsky commented on SOLR-1301:
------------------------------------------
OK, so I changed the patch to work with 3.5 ant build and re-tested it with
Solr 3.5 and Cloudera's CDH3u3 (both the build and csv test run in
pseudo-distributed mode). Still no unit tests but I am working on this :)
No changes compared to previous version except that I had to comment out the
code that sets the debug level dynamically in SolrRecordWriter - because of the
conflics with slf4j parts in current Solr; I think it is minor but if not
please feel free to resolve this and update the patch. With this done, no need
to put the log4j and commons-logging jars in the hadoop/lib at a compile time
anymore, only the hadoop jar. I provided the hadoop-core-0.20.2-cdh3u3.jar used
for testing as a part of the patch but you can use the other versions of 0.20.x
if you'd like; it also should work with hadoop 0.21.x. Note that you still need
to make the other related jars (solr, solrj, lucene, commons etc) available
while you running your job; one way to do this is to put all the needed jars
into the lib subfolder of apache-solr-hadoop jar, another ways are described
here:
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/.
Finally, the quick steps to get the patch compiled (on linux):
1. get the solr source tarball (apache-solr-3.5.0-src.tgz in this example),
put it into some folder, cd there
2. tar -xzf apache-solr-3.5.0-src.tgz
3. cd apache-solr-3.5.0/solr
4. wget
https://issues.apache.org/jira/secure/attachment/12515662/SOLR-1301.patch
5. patch -p0 -i SOLR-1301.patch
6. mkdir contrib/hadoop/lib
7. cd contrib/hadoop/lib
8. wget
https://issues.apache.org/jira/secure/attachment/12515663/hadoop-core-0.20.2-cdh3u3.jar
9. cd ../../..
10. ant dist
and you should have the apache-solr-hadoop-3.5-SNAPSHOT.jar in solr/dist folder.
> Solr + Hadoop
> -------------
>
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Andrzej Bialecki
> Fix For: 3.6, 4.0
>
> Attachments: README.txt, SOLR-1301-hadoop-0-20.patch,
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
> SolrRecordWriter.java, commons-logging-1.0.4.jar,
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
> hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
> log4j-1.2.15.jar
>
>
> This patch contains a contrib module that provides distributed indexing
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS.
> SolrOutputFormat consumes data produced by reduce tasks directly, without
> storing it in intermediate files. Furthermore, by using an
> EmbeddedSolrServer, the indexing task is split into as many parts as there
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
> instantiates an EmbeddedSolrServer, and it also instantiates an
> implementation of SolrDocumentConverter, which is responsible for turning
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce
> task completes, and the OutputFormat is closed, SolrRecordWriter calls
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories
> as there were reduce tasks. The output shards are placed in the output
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories
> can be used to run N shard servers. Additionally, users can specify the
> number of reduce tasks, in particular 1 reduce task, in which case the output
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor
> and approved for release under Apache License.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]