[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595806#comment-13595806 ] Hudson commented on NUTCH-1047: --- Integrated in Nutch-trunk-Windows #57 (See [https://builds.apache.org/job/Nutch-trunk-Windows/57/]) NUTCH-1047 Pluggable indexing backends (Revision 1453776) Result = FAILURE jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/build.xml * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/default.properties * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/bin/nutch * /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java * /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java * /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java * /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/indexer-solr * /nutch/trunk/src/plugin/indexer-solr/build.xml * /nutch/trunk/src/plugin/indexer-solr/ivy.xml * /nutch/trunk/src/plugin/indexer-solr/plugin.xml * /nutch/trunk/src/plugin/indexer-solr/src * /nutch/trunk/src/plugin/indexer-solr/src/java * /nutch/trunk/src/plugin/indexer-solr/src/java/org * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java * /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595809#comment-13595809 ] Hudson commented on NUTCH-1047: --- Integrated in Nutch-trunk #2144 (See [https://builds.apache.org/job/Nutch-trunk/2144/]) NUTCH-1047 Pluggable indexing backends (Revision 1453776) Result = SUCCESS jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/build.xml * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/default.properties * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/bin/nutch * /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java * /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java * /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java * /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/indexer-solr * /nutch/trunk/src/plugin/indexer-solr/build.xml * /nutch/trunk/src/plugin/indexer-solr/ivy.xml * /nutch/trunk/src/plugin/indexer-solr/plugin.xml * /nutch/trunk/src/plugin/indexer-solr/src * /nutch/trunk/src/plugin/indexer-solr/src/java * /nutch/trunk/src/plugin/indexer-solr/src/java/org * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java * /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java * /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596042#comment-13596042 ] Lewis John McGibbney commented on NUTCH-1047: - Nice worj Julien. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252 ] Sebastian Nagel commented on NUTCH-1047: Hi Julien, in overall, all looks good. A first version of the CSV indexer is ready (NUTCH-1541) and works well with the last v5 patch. One point we should improve is the command-line help. I agree with Tejas that the help should list all required arguments. Of course, you are right the index/cleaning jobs are backend-neutral but then it would be preferable to have new commands index and indexclean. They are also required if other indexer back-ends are used. We can keep the solr* commands for legacy and because they are handy. A few additional lines to generate the prior help text are tolerable and could avoid unnecessary user requests on the mailing list. The describe() method is a good idea. The new commands will then show sufficient help but IndexingJob/CleaningJob should also call describe() when help is shown! Some trivialities to get the Java docs right: * default.properties - need to add the new plugins.indexer group with indexer-solr as member * build.xml - add group referring to plugins.indexer, add Java doc targets for indexer-solr Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583011#comment-13583011 ] Tejas Patil commented on NUTCH-1047: Hi Julien, One small change in Java class will be to display this usage message to the user: {noformat}$ bin/nutch solrclean Usage: CleaningJob crawldb solrurl [-noCommit]{noformat} The current patch doesnt display solrurl in the usage. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583111#comment-13583111 ] Julien Nioche commented on NUTCH-1047: -- Tejas, The CleaningJob is backend-neutral and as such should not expect solrurl as a parameter. Same as with the IndexingJob really Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582034#comment-13582034 ] Julien Nioche commented on NUTCH-1047: -- Hi Tejas Thank you for taking the time to have a look. The SolrClean command has been modified too to use the plugin architecture and that should be the last thing I think. Thanks Julien Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582163#comment-13582163 ] Tejas Patil commented on NUTCH-1047: Hey Julien, While running the solrclean command, I followed the old usage given here [0]. It gave an exception. Then I saw the usage and it gave {noformat}$ bin/nutch solrclean Usage: CleaningJob crawldb [-noCommit]{noformat} That did not work too. It just prints the usage if only the crawldb is passed as an argument. I went through the patch and realized that the bin/nutch script considers the first argument as the solr url and then the left over ie. the crawldb is passed to the java code. This is what worked for me: {noformat}bin/nutch solrclean solrurl crawldb{noformat} This is different from the old usage given at [0]. We can prevent from changing the ordering of the arguments and preserve the old usage. This can be used in bin/nutch script: {noformat}CLASS=org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2{noformat} and not perform a shift after that. Corresponding usage must be modified in the java code too. [0] : http://wiki.apache.org/nutch/bin/nutch%20solrclean Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582183#comment-13582183 ] Julien Nioche commented on NUTCH-1047: -- Hi Tejas Good catch, could do {color:red} CLASS=org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1 shift; shift {color} There is no change to do in the code Java as it expects only one argument which is the crawldb. Could also get the CleaningJob to log which indexers are available. re-solrdedup : the explanation is given earlier in this thread. It is a SOLR-specific approach and we can't run a job located in a plugin. The main job file has to be in the core code. We need a better deduplicator anyway Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581910#comment-13581910 ] lufeng commented on NUTCH-1047: --- The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the configuration file schema-solr4.xml of Sor 4.1 hit a patch of [NUTCH-1486|https://issues.apache.org/jira/browse/NUTCH-1486]. It will be better if index can report progress. good job, thanks Julien. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581964#comment-13581964 ] Tejas Patil commented on NUTCH-1047: Hi Julien, The crawl command (with solr option) and solrindex command are working properly now :) Is there anything else that you think must be verified ? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565202#comment-13565202 ] Tejas Patil commented on NUTCH-1047: Hi Julien, As you suggested, I tried to run solrindex command without setting solr.server.url in nutch-site.xml or -D. Command used: {noformat}bin/nutch solrindex http://localhost:8983/solr mycrawl/crawldb/ mycrawl/segments/201301280439/{noformat} It says: {noformat} Usage: Indexer crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-filter] [-normalize]{noformat} The check for number of args is causing this. I corrected it locally and it worked fine after that. As per usage above, user needs to provide just the crawldb and segment. But user need solrurl to be passed which is consumed by the bin/nutch script. The usage message must be changed to hide this mechanism from user. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564173#comment-13564173 ] Julien Nioche commented on NUTCH-1047: -- @tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks. Lufeng {quote} But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/. {quote} whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there. {quote} the corrent command to invoke the IndexingJob command is bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter. {quote} as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR. Think about 'nutch solrindex' as a shortcut for the generic command. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187 ] Tejas Patil commented on NUTCH-1047: Hi Julien, After reply from @lufeng, I was able to perform indexing with the crawl command. Here is a summary of things I have observed: ||solr.server.url in nutch-site.xml||-D in crawl command||Works ?|| |no|no|RuntimeException: Missing SOLR URL| |no|yes|yes| |yes|no|yes| |yes|yes|yes| Note that I had to pass -solr and solr url everytime. Else it didnt invoke indexing. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564196#comment-13564196 ] Julien Nioche commented on NUTCH-1047: -- Hi Tejas It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120). BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command? Thanks Julien Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252 ] Tejas Patil commented on NUTCH-1047: Hi Julien, The solrindex commmand and crawl script are work fine after setting solr.server.url in nutch-site.xml. I did not use -D option during these runs. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564263#comment-13564263 ] Julien Nioche commented on NUTCH-1047: -- Tejas The crawl script and the solr index should work without setting solr.server.url in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying solr.server.url in nutch-site.xml? Thanks Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827 ] Sebastian Nagel commented on NUTCH-1047: As some test for the interface started to implement a CSV-indexer - useful for exporting crawled data or for quick analysis. First working version (draft, still a lot to do) within 100+ lines of code: +1 for the interface / extension point. Some concerns about the usability of IndexingJob as a daily tool: - it's not really transparent which indexer is run (solr, elastic, etc.): you have to look into the property plugin-includes - options must be passed to indexer plugins as properties: complicated, no help to get a list of available properties Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564076#comment-13564076 ] lufeng commented on NUTCH-1047: --- Hi Tejas Maybe you don't add -D option with bin/nutch crawl command. they are all used to set the solr.server.url parameter. And the cause of the unknown field content error is that maybe you don't config the solr schema.xml correctly. Do you copy the conf/schema.xml in nutch conf to the example/solr/conf directory. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564089#comment-13564089 ] lufeng commented on NUTCH-1047: --- Hi Julien, I found in bin/nutch there is a line like this CLASS=org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1 , But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/. But Now i found that the corrent command to invoke the IndexingJob command is bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter. :( Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564095#comment-13564095 ] Tejas Patil commented on NUTCH-1047: Hi Lufeng, You are right. There was a problem with my schema.xml file. I corrected it and now things are working. Thanks !! Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562558#comment-13562558 ] Julien Nioche commented on NUTCH-1047: -- Hi Lufeng. The solrindex command in the nutch script works just as before. You can also invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter e.g. {{-D solr.server.url=xx}} SolrUtils is duplicated indeed because of DeleteDuplicates, which is a SOLR-specific implementation. We need to build a generic deduplicator at some point and it will use the pluggable backends. I decided to leave the SOLR-based one in for now, but if most people don't use it then we should probably shelve it. This is a separate issue though. Thanks for your comments Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562369#comment-13562369 ] lufeng commented on NUTCH-1047: --- Hi, i put the patch , but i do not found how to set solrURI, and the class SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will be pluggable in backends too. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557322#comment-13557322 ] Markus Jelsma commented on NUTCH-1047: -- Excellent work my friend! I'll be sure to test this next week! Hopefully it all works out fine and i can rewrite the other indexing patches with ease. Cheers! Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556026#comment-13556026 ] Julien Nioche commented on NUTCH-1047: -- Good point Markus, thanks. The main issue I am struggling with at the moment is what to do with the SOLR deduplication. I don't think we can run a MapReduce job from a plugin so it's not going to work. One (temporary) option would be to leave it as is so that the crawl command works as expected as well as the crawl script and the nutch command and we then get rid of it when we have a generic deduplication job. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556039#comment-13556039 ] Markus Jelsma commented on NUTCH-1047: -- I had an issue with dedup too in NUTCH-1480, unless we do something about it i cannot commit that. Personally i'd prefer to never touch that class again but keep it as legacy. What do you think? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556041#comment-13556041 ] Julien Nioche commented on NUTCH-1047: -- We definitely need a better mechanism for deduplication. +1 to leave as is for now until we have a better option. Slightly annoying for this issue is that it means adding it back to the main classes as well as SOLR as dependency, not a big deal though. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556052#comment-13556052 ] Markus Jelsma commented on NUTCH-1047: -- Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on NUTCH-1377. Are you sure you cannot run a MapReduce program from within a plugin? I think it's worth trying :) Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556054#comment-13556054 ] Julien Nioche commented on NUTCH-1047: -- Tried, failed. Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you improve the SOLR-backends? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556075#comment-13556075 ] Markus Jelsma commented on NUTCH-1047: -- too bad. I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to rewrite the patches then ;) Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556079#comment-13556079 ] Julien Nioche commented on NUTCH-1047: -- Should not be a big deal as the classes affected by NUTCH-1480 are not modified that much by NUTCH-1047 and it also means that you'll get to look at the code for this issue which is a good way of reviewing it :-) Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556084#comment-13556084 ] Markus Jelsma commented on NUTCH-1047: -- {quote}which is a good way of reviewing it{quote} Cheers! Looking forward to your new patch. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556091#comment-13556091 ] Julien Nioche commented on NUTCH-1047: -- my suggestion was that you give NUTCH-1047 a try, wait until it is committed then commit your changes to it, not that I'd patch it to include your changes. BTW have commented on NUTCH-1480 thanks Julien Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556096#comment-13556096 ] Markus Jelsma commented on NUTCH-1047: -- no, i understood correctly :) Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1376#comment-1376 ] Markus Jelsma commented on NUTCH-1047: -- Very nice Julien! Can you also add update() to the writer interface? See NUTCH-1506. Some impls can do this such as recent Solr commits. Other impls can defer to add() if applicable or return throw UnsupportedOperation. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.7 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163481#comment-13163481 ] Julien Nioche commented on NUTCH-1047: -- bq. If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format. bq. Does your suggestion imply: segment+crawldb warc files search engine? Nope,let's start again :-) We mentioned in this issue that we'd like to make the indexing backends pluggable in order to simplify the code and make it easier for others to implement alternative backends. We currently have only SOLR, ES is clearly a good candidate and you've rightly pointed out that we could have a XML dump of the docs. I would add that we could plug in JDBC or HBase etc... WARC is just another example of something we could have as a plugin. The question was : is there a functional difference between say [XML|WARC] and [SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle deletetions, not the XML or WARC one. Are there any more such differences? Is is an index vs dump issue? A remote vs local one? Would it make sense to have on one hand an indexer with plugins supporting deletions and expecting a URL and on the other a separate job for converting segments and crawldb to XML, WARC etc... Does it make more sense? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163515#comment-13163515 ] Markus Jelsma commented on NUTCH-1047: -- Ah yes it makes sense now! If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in this case, implements the delete method as called from NutchIndexAction. Another endpoint could simply ignore and do nothing but write out WARC or Solr XML files. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163704#comment-13163704 ] Julien Nioche commented on NUTCH-1047: -- The class NutchIndexWriter and NutchIndexWriterFactory already provide us with the type of abstraction we need. We could turn the interface NutchIndexWriter into an endpoint and add the methods we need (e.g. delete). What is not clear yet is what IndexerOutputFormat is used for and whether we will be able to use implementations of NutchIndexWriter from within a plugin. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162704#comment-13162704 ] Julien Nioche commented on NUTCH-1047: -- It would be nice to have a plugin implementing this endpoint to generate WARC files. There seems to be two different situations though : one where we send docs to servers (SOLR, ES) and one where we generate files. Do we need to handle deletions for the latter? I don't think so but we would need to for the former. Any thoughts on this? Would it make sense to have 2 different endpoints or not? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162977#comment-13162977 ] Markus Jelsma commented on NUTCH-1047: -- Hi Julien, I'm not sure i get your point exactly but if we don't generate WARC files we: - don't have to think about the problem you state - don't create an additional process between Nutch and a search engine If you'd need WARC files, for some reason, i'd rather have an endpoint for it just like for ES and Solr instead of using WARC files as an intermediate format. Does your suggestion imply: segment+crawldb warc files search engine? Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066484#comment-13066484 ] Julien Nioche commented on NUTCH-1047: -- {quote} My interest in your last point is a question which I suppose is wide open to discussion. What end-points (generally speaking) are we going to support and formally represent as pluggable entities? What criteria do we make decisions based on? {quote} We'll simply port the existing SOLR indexing to the plugin-based architecture so that people can easily add the backends they need. If there is a widespread need for a specific backend then I suppose someone will contribute patches and it might get committed. It's not like we need to define which backends (not same as endpoints BTW) would be added etc... we are just giving people the possibility of simply adding theirs without having to do a dirty hack of the indexer. There is currently a growing interest for ElasticSearch and I know of at least one person who's modified the SOLR indexer to get it to work for ES. This would be a good candidate for inclusion, apart from that let's see what people contribute. Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Julien Nioche Labels: indexing Fix For: 1.4 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various end-points. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira