[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595806#comment-13595806
 ] 

Hudson commented on NUTCH-1047:
---

Integrated in Nutch-trunk-Windows #57 (See 
[https://builds.apache.org/job/Nutch-trunk-Windows/57/])
NUTCH-1047 Pluggable indexing backends (Revision 1453776)

 Result = FAILURE
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/indexer-solr
* /nutch/trunk/src/plugin/indexer-solr/build.xml
* /nutch/trunk/src/plugin/indexer-solr/ivy.xml
* /nutch/trunk/src/plugin/indexer-solr/plugin.xml
* /nutch/trunk/src/plugin/indexer-solr/src
* /nutch/trunk/src/plugin/indexer-solr/src/java
* /nutch/trunk/src/plugin/indexer-solr/src/java/org
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595809#comment-13595809
 ] 

Hudson commented on NUTCH-1047:
---

Integrated in Nutch-trunk #2144 (See 
[https://builds.apache.org/job/Nutch-trunk/2144/])
NUTCH-1047 Pluggable indexing backends (Revision 1453776)

 Result = SUCCESS
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/indexer-solr
* /nutch/trunk/src/plugin/indexer-solr/build.xml
* /nutch/trunk/src/plugin/indexer-solr/ivy.xml
* /nutch/trunk/src/plugin/indexer-solr/plugin.xml
* /nutch/trunk/src/plugin/indexer-solr/src
* /nutch/trunk/src/plugin/indexer-solr/src/java
* /nutch/trunk/src/plugin/indexer-solr/src/java/org
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596042#comment-13596042
 ] 

Lewis John McGibbney commented on NUTCH-1047:
-

Nice worj Julien.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252
 ] 

Sebastian Nagel commented on NUTCH-1047:


Hi Julien,

in overall, all looks good. A first version of the CSV indexer is ready 
(NUTCH-1541) and works well with the last v5 patch.

One point we should improve is the command-line help. I agree with Tejas that 
the help should list all required arguments. Of course, you are right the 
index/cleaning jobs are backend-neutral but then it would be preferable to 
have new commands index and indexclean. They are also required if other 
indexer back-ends are used. We can keep the solr* commands for legacy and 
because they are handy. A few additional lines to generate the prior help text 
are tolerable and could avoid unnecessary user requests on the mailing list.

The describe() method is a good idea. The new commands will then show 
sufficient help but IndexingJob/CleaningJob should also call describe() when 
help is shown!

Some trivialities to get the Java docs right:
* default.properties - need to add the new plugins.indexer group with 
indexer-solr as member
* build.xml - add group referring to plugins.indexer, add Java doc targets 
for indexer-solr


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-21 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583011#comment-13583011
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien,

One small change in Java class will be to display this usage message to the 
user:
{noformat}$ bin/nutch solrclean 
Usage: CleaningJob crawldb solrurl [-noCommit]{noformat}

The current patch doesnt display solrurl in the usage.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583111#comment-13583111
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas,

The CleaningJob is backend-neutral and as such should not expect solrurl as a 
parameter. Same as with the IndexingJob really

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582034#comment-13582034
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

Thank you for taking the time to have a look. The SolrClean command has been 
modified too to use the plugin architecture and that should be the last thing I 
think.

Thanks

Julien 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582163#comment-13582163
 ] 

Tejas Patil commented on NUTCH-1047:


Hey Julien,

While running the solrclean command, I followed the old usage given here [0]. 
It gave an exception. Then I saw the usage and it gave
{noformat}$ bin/nutch solrclean 
Usage: CleaningJob crawldb [-noCommit]{noformat}

That did not work too. It just prints the usage if only the crawldb is passed 
as an argument. I went through the patch and realized that the bin/nutch script 
considers the first argument as the solr url and then the left over ie. the 
crawldb is passed to the java code. This is what worked for me:
{noformat}bin/nutch solrclean solrurl crawldb{noformat}

This is different from the old usage given at [0]. We can prevent from changing 
the ordering of the arguments and preserve the old usage. This can be used in 
bin/nutch script:
{noformat}CLASS=org.apache.nutch.indexer.CleaningJob -D 
solr.server.url=$2{noformat} and not perform a shift after that. 
Corresponding usage must be modified in the java code too.

[0] : http://wiki.apache.org/nutch/bin/nutch%20solrclean

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582183#comment-13582183
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

Good catch, could do 

{color:red} 
  CLASS=org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1
  shift; shift
{color}

There is no change to do in the code Java as it expects only one argument which 
is the crawldb. Could also get the CleaningJob to log which indexers are 
available.

re-solrdedup : the explanation is given earlier in this thread. It is a 
SOLR-specific approach and we can't run a job located in a plugin. The main job 
file has to be in the core code. We need a better deduplicator anyway

 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-19 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581910#comment-13581910
 ] 

lufeng commented on NUTCH-1047:
---

The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the 
configuration file schema-solr4.xml of Sor 4.1 hit a patch of 
[NUTCH-1486|https://issues.apache.org/jira/browse/NUTCH-1486]. 

It will be better if index can report progress. 

good job, thanks Julien. 



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581964#comment-13581964
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien,

The crawl command (with solr option) and solrindex command are working properly 
now :) Is there anything else that you think must be verified ?

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-29 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565202#comment-13565202
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien,

As you suggested, I tried to run solrindex command without setting 
solr.server.url in nutch-site.xml or -D.
 
Command used: {noformat}bin/nutch solrindex http://localhost:8983/solr 
mycrawl/crawldb/ mycrawl/segments/201301280439/{noformat}

It says: {noformat}
Usage: Indexer crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment 
... | -dir segments) [-noCommit] [-deleteGone] [-filter] 
[-normalize]{noformat}

The check for number of args is causing this. I corrected it locally and it 
worked fine after that.
As per usage above, user needs to provide just the crawldb and segment. But 
user need solrurl to be passed which is consumed by the bin/nutch script. The 
usage message must be changed to hide this mechanism from user.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564173#comment-13564173
 ] 

Julien Nioche commented on NUTCH-1047:
--

@tejasp can reproduce the issue and am looking into it, thanks. Somehow the 
configuration does not get passed on properly when using the crawl command. 
Thanks.

Lufeng 
{quote}
But i don't know why not add an option to set IndexerUrl such as bin/nutch 
solrindex -indexurl http://localhost:8983/solr/.
{quote}

whether it is passed as a parameter or via configuration should not make much 
of a difference. Your suggestion also assumes that the indexing backend can be 
reached via a single URL which is not necessarily the case as it could not need 
a URL at all or at the opposite need multiple URLs. Better to leave that logic 
in the configuration and assume that the backends will find whatever they need 
there.

{quote}
 the corrent command to invoke the IndexingJob command is bin/nutch solrindex 
http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter.
{quote}

as explained above we want to keep compatibility with the existing sorlindex 
command and not change its syntax. Underneath it uses the new code based on 
plugins but sets the value of the solr config. There is no shortcut for the 
generic indexing job command in the nutch script yet but we could add one. For 
now it has to be called in full e.g. bin/nutch 
org.apache.nutch.indexer.IndexingJob ... which will make sense when we have 
other indexing backends and not just SOLR.

Think about 'nutch solrindex' as a shortcut for the generic command.







 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien, 
After reply from @lufeng, I was able to perform indexing with the crawl 
command. Here is a summary of things I have observed:
||solr.server.url in nutch-site.xml||-D in crawl command||Works ?|| 
|no|no|RuntimeException: Missing SOLR URL| 
|no|yes|yes|
|yes|no|yes|
|yes|yes|yes|

Note that I had to pass -solr and solr url everytime. Else it didnt invoke 
indexing.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564196#comment-13564196
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

It will work everytime you set it in nutch-site.xml. As for setting it with -D 
in the crawl command - you definitely should not have to do that and this is 
where the bug is. The problem is that for some reason we value we take from the 
crawl command is correctly set in the configuration object however the later is 
reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob 
line 120).

BTW the crawl command is deprecated and should be removed at some point as we 
have the crawl script. Could you try using the SOLRIndex command as well as the 
crawl script while I try and solve the problem with the crawl command?

Thanks

Julien



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien, The solrindex commmand and crawl script are work fine after setting 
solr.server.url in nutch-site.xml. I did not use -D option during these 
runs.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564263#comment-13564263
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas

The crawl script and the solr index should work without setting 
solr.server.url in nutch-site.xml or using -D as this is handled for you in 
the nutch script. Can you please test without specifying solr.server.url in 
nutch-site.xml?

Thanks

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827
 ] 

Sebastian Nagel commented on NUTCH-1047:


As some test for the interface started to implement a CSV-indexer - useful for 
exporting crawled data or for quick analysis. First working version (draft, 
still a lot to do) within 100+ lines of code: +1 for the interface / extension 
point.

Some concerns about the usability of IndexingJob as a daily tool:
- it's not really transparent which indexer is run (solr, elastic, etc.): you 
have to look into the property plugin-includes
- options must be passed to indexer plugins as properties: complicated, no help 
to get a list of available properties



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564076#comment-13564076
 ] 

lufeng commented on NUTCH-1047:
---

Hi Tejas

Maybe you don't add -D option with bin/nutch crawl command. they are all used 
to set the solr.server.url parameter. And the cause of the unknown field 
content error is that maybe you don't config the solr schema.xml correctly. Do 
you copy the conf/schema.xml in nutch conf to the example/solr/conf directory.


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564089#comment-13564089
 ] 

lufeng commented on NUTCH-1047:
---

Hi Julien,

I found in bin/nutch there is a line like this 
CLASS=org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1 , But i 
don't know why not add an option to set IndexerUrl such as bin/nutch solrindex 
-indexurl http://localhost:8983/solr/. 

But Now i found that the corrent command to invoke the IndexingJob command is 
bin/nutch solrindex http://localhost:8983/solr/ crawldb/ 
segments/20130121115214/ -filter. :(

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564095#comment-13564095
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Lufeng,

You are right. There was a problem with my schema.xml file. I corrected it and 
now things are working. Thanks !!

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562558#comment-13562558
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Lufeng. 

The solrindex command in the nutch script works just as before. You can also 
invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter 
e.g. {{-D solr.server.url=xx}}

SolrUtils is duplicated indeed because of DeleteDuplicates, which is a 
SOLR-specific implementation. We need to build a generic deduplicator at some 
point and it will use the pluggable backends. I decided to leave the SOLR-based 
one in for now, but if most people don't use it then we should probably shelve 
it. This is a separate issue though.

Thanks for your comments

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-24 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562369#comment-13562369
 ] 

lufeng commented on NUTCH-1047:
---

Hi, i put the patch , but i do not found how to set solrURI, and the class 
SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will 
be pluggable in backends too.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557322#comment-13557322
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Excellent work my friend! I'll be sure to test this next week! Hopefully it all 
works out fine and i can rewrite the other indexing patches with ease.

Cheers!

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556026#comment-13556026
 ] 

Julien Nioche commented on NUTCH-1047:
--

Good point Markus, thanks.
The main issue I am struggling with at the moment is what to do with the SOLR 
deduplication. I don't think we can run a MapReduce job from a plugin so it's 
not going to work. One (temporary) option would be to leave it as is so that 
the crawl command works as expected as well as the crawl script and the nutch 
command and we then get rid of it when we have a generic deduplication job. 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556039#comment-13556039
 ] 

Markus Jelsma commented on NUTCH-1047:
--

I had an issue with dedup too in NUTCH-1480, unless we do something about it i 
cannot commit that. Personally i'd prefer to never touch that class again but 
keep it as legacy. What do you think? 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556041#comment-13556041
 ] 

Julien Nioche commented on NUTCH-1047:
--

We definitely need a better mechanism for deduplication. +1 to leave as is for 
now until we have a better option. Slightly annoying for this issue is that it 
means adding it back to the main classes as well as SOLR as dependency, not a 
big deal though.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556052#comment-13556052
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on 
NUTCH-1377.

Are you sure you cannot run a MapReduce program from within a plugin? I think 
it's worth trying :)

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556054#comment-13556054
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tried, failed. 
Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you 
improve the SOLR-backends?

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556075#comment-13556075
 ] 

Markus Jelsma commented on NUTCH-1047:
--

too bad.

I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to 
rewrite the patches then ;)



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556079#comment-13556079
 ] 

Julien Nioche commented on NUTCH-1047:
--

Should not be a big deal as the classes affected by NUTCH-1480 are not modified 
that much by NUTCH-1047 and it also means that you'll get to look at the code 
for this issue which is a good way of reviewing it :-)

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556084#comment-13556084
 ] 

Markus Jelsma commented on NUTCH-1047:
--

{quote}which is a good way of reviewing it{quote}

Cheers! Looking forward to your new patch.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556091#comment-13556091
 ] 

Julien Nioche commented on NUTCH-1047:
--

my suggestion was that you give NUTCH-1047 a try, wait until it is committed 
then commit your changes to it, not that I'd patch it to include your changes.

BTW have commented on NUTCH-1480

thanks

Julien



 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556096#comment-13556096
 ] 

Markus Jelsma commented on NUTCH-1047:
--

no, i understood correctly :)


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1376#comment-1376
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Very nice Julien! Can you also add update() to the writer interface? See 
NUTCH-1506. Some impls can do this such as recent Solr commits. Other impls can 
defer to add() if applicable or return throw UnsupportedOperation.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163481#comment-13163481
 ] 

Julien Nioche commented on NUTCH-1047:
--

bq. If you'd need WARC files, for some reason, i'd rather have an endpoint for 
it just like for ES and Solr instead of using WARC files as an intermediate 
format.
bq. Does your suggestion imply: segment+crawldb  warc files  search engine? 

Nope,let's start again :-)
We mentioned in this issue that we'd like to make the indexing backends 
pluggable in order to simplify the code and make it easier for others to 
implement alternative backends. We currently have only SOLR, ES is clearly a 
good candidate and you've rightly pointed out that we could have a XML dump of 
the docs. I would add that we could plug in JDBC or HBase etc... WARC is just 
another example of something we could have as a plugin. 

The question was : is there a functional difference between say [XML|WARC] and 
[SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle 
deletetions, not the XML or WARC one. Are there any more such differences? Is 
is an index vs dump issue? A remote vs local one? Would it make sense to have 
on one hand an indexer with plugins supporting deletions and expecting a URL 
and on the other a separate job for converting segments and crawldb to XML, 
WARC etc...

Does it make more sense? 




 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163515#comment-13163515
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Ah yes it makes sense now!
 
If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in 
this case, implements the delete method as called from NutchIndexAction. 
Another endpoint could simply ignore and do nothing but write out WARC or Solr 
XML files.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163704#comment-13163704
 ] 

Julien Nioche commented on NUTCH-1047:
--

The class NutchIndexWriter and NutchIndexWriterFactory already provide us with 
the type of abstraction we need. We could turn the interface NutchIndexWriter 
into an endpoint and add the methods we need (e.g. delete). What is not clear 
yet is what IndexerOutputFormat is used for and whether we will be able to use 
implementations of NutchIndexWriter from within a plugin.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162704#comment-13162704
 ] 

Julien Nioche commented on NUTCH-1047:
--

It would be nice to have a plugin implementing this endpoint to generate WARC 
files. There seems to be two different situations though : one where we send 
docs to servers (SOLR, ES) and one where we generate files. Do we need to 
handle deletions for the latter? I don't think so but we would need to for the 
former. 

Any thoughts on this? Would it make sense to have 2 different endpoints or not?

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162977#comment-13162977
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Hi Julien,

I'm not sure i get your point exactly but if we don't generate WARC files we:
- don't have to think about the problem you state
- don't create an additional process between Nutch and a search engine

If you'd need WARC files, for some reason, i'd rather have an endpoint for it 
just like for ES and Solr instead of using WARC files as an intermediate format.

Does your suggestion imply: segment+crawldb  warc files  search engine? 


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-07-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066484#comment-13066484
 ] 

Julien Nioche commented on NUTCH-1047:
--

{quote}
My interest in your last point is a question which I suppose is wide open to 
discussion. What end-points (generally speaking) are we going to support and 
formally represent as pluggable entities? What criteria do we make decisions 
based on?
{quote}

We'll simply port the existing SOLR indexing to the plugin-based architecture 
so that people can easily add the backends they need. If there is a widespread 
need for a specific backend then I suppose someone will contribute patches and 
it might get committed. It's not like we need to define which backends (not 
same as endpoints BTW) would be added etc... we are just giving people the 
possibility of simply adding theirs without having to do a dirty hack of the 
indexer.

There is currently a growing interest for ElasticSearch and I know of at least 
one person who's modified the SOLR indexer to get it to work for ES. This would 
be a good candidate for inclusion, apart from that let's see what people 
contribute.



 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Julien Nioche
  Labels: indexing
 Fix For: 1.4


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various end-points.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira