subject:"\[jira\] \[Commented\] $NUTCH\-1047$ Pluggable indexing backends"

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595806#comment-13595806
 ] 

Hudson commented on NUTCH-1047:
---

Integrated in Nutch-trunk-Windows #57 (See 
[https://builds.apache.org/job/Nutch-trunk-Windows/57/])
NUTCH-1047 Pluggable indexing backends (Revision 1453776)

 Result = FAILURE
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/indexer-solr
* /nutch/trunk/src/plugin/indexer-solr/build.xml
* /nutch/trunk/src/plugin/indexer-solr/ivy.xml
* /nutch/trunk/src/plugin/indexer-solr/plugin.xml
* /nutch/trunk/src/plugin/indexer-solr/src
* /nutch/trunk/src/plugin/indexer-solr/src/java
* /nutch/trunk/src/plugin/indexer-solr/src/java/org
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595809#comment-13595809
 ] 

Hudson commented on NUTCH-1047:
---

Integrated in Nutch-trunk #2144 (See 
[https://builds.apache.org/job/Nutch-trunk/2144/])
NUTCH-1047 Pluggable indexing backends (Revision 1453776)

 Result = SUCCESS
jnioche : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1453776
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/bin/nutch
* /nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexWriters.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/indexer-solr
* /nutch/trunk/src/plugin/indexer-solr/build.xml
* /nutch/trunk/src/plugin/indexer-solr/ivy.xml
* /nutch/trunk/src/plugin/indexer-solr/plugin.xml
* /nutch/trunk/src/plugin/indexer-solr/src
* /nutch/trunk/src/plugin/indexer-solr/src/java
* /nutch/trunk/src/plugin/indexer-solr/src/java/org
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* 
/nutch/trunk/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* /nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-07 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596042#comment-13596042
 ] 

Lewis John McGibbney commented on NUTCH-1047:
-

Nice worj Julien.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-final.patch, NUTCH-1047-1.x-v1.patch, 
 NUTCH-1047-1.x-v2.patch, NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, 
 NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-03-06 Thread Sebastian Nagel (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595252#comment-13595252
]

Sebastian Nagel commented on NUTCH-1047:

Hi Julien,

in overall, all looks good. A first version of the CSV indexer is ready
(NUTCH-1541) and works well with the last v5 patch.

One point we should improve is the command-line help. I agree with Tejas that
the help should list all required arguments. Of course, you are right the
index/cleaning jobs are backend-neutral but then it would be preferable to
have new commands index and indexclean. They are also required if other
indexer back-ends are used. We can keep the solr* commands for legacy and
because they are handy. A few additional lines to generate the prior help text
are tolerable and could avoid unnecessary user requests on the mailing list.

The describe() method is a good idea. The new commands will then show
sufficient help but IndexingJob/CleaningJob should also call describe() when
help is shown!

Some trivialities to get the Java docs right:
* default.properties - need to add the new plugins.indexer group with
indexer-solr as member
* build.xml - add group referring to plugins.indexer, add Java doc targets
for indexer-solr

Pluggable indexing backends
---

Key: NUTCH-1047
URL: https://issues.apache.org/jira/browse/NUTCH-1047
Project: Nutch
Issue Type: New Feature
Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
Labels: indexing
Fix For: 1.7

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch

One possible feature would be to add a new endpoint for indexing-backends and
make the indexing plugable. at the moment we are hardwired to SOLR - which is
OK - but as other resources like ElasticSearch are becoming more popular it
would be better to handle this as plugins. Not sure about the name of the
endpoint though : we already have indexing-plugins (which are about
generating fields sent to the backends) and moreover the backends are not
necessarily for indexing / searching but could be just an external storage
e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this
could be pertaining to the storage in GORA. 'indexing-backend' is the best
name that came to my mind so far - please suggest better ones.
We should come up with generic map/reduce jobs for indexing, deduplicating
and cleaning and maybe add a Nutch extension point there so we can easily
hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-21 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583011#comment-13583011
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien,

One small change in Java class will be to display this usage message to the 
user:
{noformat}$ bin/nutch solrclean 
Usage: CleaningJob crawldb solrurl [-noCommit]{noformat}

The current patch doesnt display solrurl in the usage.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583111#comment-13583111
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas,

The CleaningJob is backend-neutral and as such should not expect solrurl as a 
parameter. Same as with the IndexingJob really

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582034#comment-13582034
]

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

Thank you for taking the time to have a look. The SolrClean command has been
modified too to use the plugin architecture and that should be the last thing I
think.

Thanks

Julien

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582163#comment-13582163
]

Tejas Patil commented on NUTCH-1047:

Hey Julien,

While running the solrclean command, I followed the old usage given here [0].
It gave an exception. Then I saw the usage and it gave
{noformat}$ bin/nutch solrclean
Usage: CleaningJob crawldb [-noCommit]{noformat}

That did not work too. It just prints the usage if only the crawldb is passed
as an argument. I went through the patch and realized that the bin/nutch script
considers the first argument as the solr url and then the left over ie. the
crawldb is passed to the java code. This is what worked for me:
{noformat}bin/nutch solrclean solrurl crawldb{noformat}

This is different from the old usage given at [0]. We can prevent from changing
the ordering of the arguments and preserve the old usage. This can be used in
bin/nutch script:
{noformat}CLASS=org.apache.nutch.indexer.CleaningJob -D
solr.server.url=$2{noformat} and not perform a shift after that.
Corresponding usage must be modified in the java code too.

[0] : http://wiki.apache.org/nutch/bin/nutch%20solrclean

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-20 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582183#comment-13582183
]

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

Good catch, could do

{color:red}
CLASS=org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1
shift; shift
{color}

There is no change to do in the code Java as it expects only one argument which
is the crawldb. Could also get the CleaningJob to log which indexers are
available.

re-solrdedup : the explanation is given earlier in this thread. It is a
SOLR-specific approach and we can't run a job located in a plugin. The main job
file has to be in the core code. We need a better deduplicator anyway

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-19 Thread lufeng (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581910#comment-13581910
]

lufeng commented on NUTCH-1047:
---

The patch v5 is work correctly in nutch 1.6 with solr 3.6. and 4.1. And the
configuration file schema-solr4.xml of Sor 4.1 hit a patch of
[NUTCH-1486|https://issues.apache.org/jira/browse/NUTCH-1486].

It will be better if index can report progress.

good job, thanks Julien.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-02-19 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581964#comment-13581964
 ] 

Tejas Patil commented on NUTCH-1047:


Hi Julien,

The crawl command (with solr option) and solrindex command are working properly 
now :) Is there anything else that you think must be verified ?

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch, NUTCH-1047-1.x-v5.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-29 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565202#comment-13565202
]

Tejas Patil commented on NUTCH-1047:

Hi Julien,

As you suggested, I tried to run solrindex command without setting
solr.server.url in nutch-site.xml or -D.

Command used: {noformat}bin/nutch solrindex http://localhost:8983/solr
mycrawl/crawldb/ mycrawl/segments/201301280439/{noformat}

It says: {noformat}
Usage: Indexer crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment
... | -dir segments) [-noCommit] [-deleteGone] [-filter]
[-normalize]{noformat}

The check for number of args is causing this. I corrected it locally and it
worked fine after that.
As per usage above, user needs to provide just the crawldb and segment. But
user need solrurl to be passed which is consumed by the bin/nutch script. The
usage message must be changed to hide this mechanism from user.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564173#comment-13564173
]

Julien Nioche commented on NUTCH-1047:
--

@tejasp can reproduce the issue and am looking into it, thanks. Somehow the
configuration does not get passed on properly when using the crawl command.
Thanks.

Lufeng
{quote}
But i don't know why not add an option to set IndexerUrl such as bin/nutch
solrindex -indexurl http://localhost:8983/solr/.
{quote}

whether it is passed as a parameter or via configuration should not make much
of a difference. Your suggestion also assumes that the indexing backend can be
reached via a single URL which is not necessarily the case as it could not need
a URL at all or at the opposite need multiple URLs. Better to leave that logic
in the configuration and assume that the backends will find whatever they need
there.

{quote}
the corrent command to invoke the IndexingJob command is bin/nutch solrindex
http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter.
{quote}

as explained above we want to keep compatibility with the existing sorlindex
command and not change its syntax. Underneath it uses the new code based on
plugins but sets the value of the solr config. There is no shortcut for the
generic indexing job command in the nutch script yet but we could add one. For
now it has to be called in full e.g. bin/nutch
org.apache.nutch.indexer.IndexingJob ... which will make sense when we have
other indexing backends and not just SOLR.

Think about 'nutch solrindex' as a shortcut for the generic command.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564187#comment-13564187
]

Tejas Patil commented on NUTCH-1047:

Hi Julien,
After reply from @lufeng, I was able to perform indexing with the crawl
command. Here is a summary of things I have observed:
||solr.server.url in nutch-site.xml||-D in crawl command||Works ?||
|no|no|RuntimeException: Missing SOLR URL|
|no|yes|yes|
|yes|no|yes|
|yes|yes|yes|

Note that I had to pass -solr and solr url everytime. Else it didnt invoke
indexing.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564196#comment-13564196
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

It will work everytime you set it in nutch-site.xml. As for setting it with -D 
in the crawl command - you definitely should not have to do that and this is 
where the bug is. The problem is that for some reason we value we take from the 
crawl command is correctly set in the configuration object however the later is 
reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob 
line 120).

BTW the crawl command is deprecated and should be removed at some point as we 
have the crawl script. Could you try using the SOLRIndex command as well as the 
crawl script while I try and solve the problem with the crawl command?

Thanks

Julien



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564252#comment-13564252
]

Tejas Patil commented on NUTCH-1047:

Hi Julien, The solrindex commmand and crawl script are work fine after setting
solr.server.url in nutch-site.xml. I did not use -D option during these
runs.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564263#comment-13564263
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas

The crawl script and the solr index should work without setting 
solr.server.url in nutch-site.xml or using -D as this is handled for you in 
the nutch script. Can you please test without specifying solr.server.url in 
nutch-site.xml?

Thanks

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564827#comment-13564827
 ] 

Sebastian Nagel commented on NUTCH-1047:


As some test for the interface started to implement a CSV-indexer - useful for 
exporting crawled data or for quick analysis. First working version (draft, 
still a lot to do) within 100+ lines of code: +1 for the interface / extension 
point.

Some concerns about the usability of IndexingJob as a daily tool:
- it's not really transparent which indexer is run (solr, elastic, etc.): you 
have to look into the property plugin-includes
- options must be passed to indexer plugins as properties: complicated, no help 
to get a list of available properties



 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread lufeng (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564076#comment-13564076
]

lufeng commented on NUTCH-1047:
---

Hi Tejas

Maybe you don't add -D option with bin/nutch crawl command. they are all used
to set the solr.server.url parameter. And the cause of the unknown field
content error is that maybe you don't config the solr schema.xml correctly. Do
you copy the conf/schema.xml in nutch conf to the example/solr/conf directory.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread lufeng (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564089#comment-13564089
 ] 

lufeng commented on NUTCH-1047:
---

Hi Julien,

I found in bin/nutch there is a line like this 
CLASS=org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1 , But i 
don't know why not add an option to set IndexerUrl such as bin/nutch solrindex 
-indexurl http://localhost:8983/solr/. 

But Now i found that the corrent command to invoke the IndexingJob command is 
bin/nutch solrindex http://localhost:8983/solr/ crawldb/ 
segments/20130121115214/ -filter. :(

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-27 Thread Tejas Patil (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564095#comment-13564095
]

Tejas Patil commented on NUTCH-1047:

Hi Lufeng,

You are right. There was a problem with my schema.xml file. I corrected it and
now things are working. Thanks !!

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-25 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562558#comment-13562558
]

Julien Nioche commented on NUTCH-1047:
--

Hi Lufeng.

The solrindex command in the nutch script works just as before. You can also
invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter
e.g. {{-D solr.server.url=xx}}

SolrUtils is duplicated indeed because of DeleteDuplicates, which is a
SOLR-specific implementation. We need to build a generic deduplicator at some
point and it will use the pluggable backends. I decided to leave the SOLR-based
one in for now, but if most people don't use it then we should probably shelve
it. This is a separate issue though.

Thanks for your comments

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-24 Thread lufeng (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562369#comment-13562369
 ] 

lufeng commented on NUTCH-1047:
---

Hi, i put the patch , but i do not found how to set solrURI, and the class 
SolrUtils is duplicated in two place, may be in later the DeleteDuplicates will 
be pluggable in backends too.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-18 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557322#comment-13557322
]

Markus Jelsma commented on NUTCH-1047:
--

Excellent work my friend! I'll be sure to test this next week! Hopefully it all
works out fine and i can rewrite the other indexing patches with ease.

Cheers!

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556026#comment-13556026
]

Julien Nioche commented on NUTCH-1047:
--

Good point Markus, thanks.
The main issue I am struggling with at the moment is what to do with the SOLR
deduplication. I don't think we can run a MapReduce job from a plugin so it's
not going to work. One (temporary) option would be to leave it as is so that
the crawl command works as expected as well as the crawl script and the nutch
command and we then get rid of it when we have a generic deduplication job.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556039#comment-13556039
]

Markus Jelsma commented on NUTCH-1047:
--

I had an issue with dedup too in NUTCH-1480, unless we do something about it i
cannot commit that. Personally i'd prefer to never touch that class again but
keep it as legacy. What do you think?

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556041#comment-13556041
]

Julien Nioche commented on NUTCH-1047:
--

We definitely need a better mechanism for deduplication. +1 to leave as is for
now until we have a better option. Slightly annoying for this issue is that it
means adding it back to the main classes as well as SOLR as dependency, not a
big deal though.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556052#comment-13556052
]

Markus Jelsma commented on NUTCH-1047:
--

Alright, i'll skip dedup for NUTCH-1480 and see if i can send it in and work on
NUTCH-1377.

Are you sure you cannot run a MapReduce program from within a plugin? I think
it's worth trying :)

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556054#comment-13556054
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tried, failed. 
Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you 
improve the SOLR-backends?

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556075#comment-13556075
]

Markus Jelsma commented on NUTCH-1047:
--

too bad.

I'm not sure, at least 1480 is ready but fine by me. too bad i'll have to
rewrite the patches then ;)

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556079#comment-13556079
 ] 

Julien Nioche commented on NUTCH-1047:
--

Should not be a big deal as the classes affected by NUTCH-1480 are not modified 
that much by NUTCH-1047 and it also means that you'll get to look at the code 
for this issue which is a good way of reviewing it :-)

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556084#comment-13556084
]

Markus Jelsma commented on NUTCH-1047:
--

{quote}which is a good way of reviewing it{quote}

Cheers! Looking forward to your new patch.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556091#comment-13556091
 ] 

Julien Nioche commented on NUTCH-1047:
--

my suggestion was that you give NUTCH-1047 a try, wait until it is committed 
then commit your changes to it, not that I'd patch it to include your changes.

BTW have commented on NUTCH-1480

thanks

Julien



 

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13556096#comment-13556096
 ] 

Markus Jelsma commented on NUTCH-1047:
--

no, i understood correctly :)


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.7

 Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
 NUTCH-1047-1.x-v3.patch


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-16 Thread Markus Jelsma (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1376#comment-1376
]

Markus Jelsma commented on NUTCH-1047:
--

Very nice Julien! Can you also add update() to the writer interface? See
NUTCH-1506. Some impls can do this such as recent Solr commits. Other impls can
defer to add() if applicable or return throw UnsupportedOperation.

Pluggable indexing backends
---

Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch,
NUTCH-1047-1.x-v3.patch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163481#comment-13163481
]

Julien Nioche commented on NUTCH-1047:
--

bq. If you'd need WARC files, for some reason, i'd rather have an endpoint for
it just like for ES and Solr instead of using WARC files as an intermediate
format.
bq. Does your suggestion imply: segment+crawldb warc files search engine?

Nope,let's start again :-)
We mentioned in this issue that we'd like to make the indexing backends
pluggable in order to simplify the code and make it easier for others to
implement alternative backends. We currently have only SOLR, ES is clearly a
good candidate and you've rightly pointed out that we could have a XML dump of
the docs. I would add that we could plug in JDBC or HBase etc... WARC is just
another example of something we could have as a plugin.

The question was : is there a functional difference between say [XML|WARC] and
[SOLR|ES]? For instance the plugin endpoint for SOLR|ES would need to handle
deletetions, not the XML or WARC one. Are there any more such differences? Is
is an index vs dump issue? A remote vs local one? Would it make sense to have
on one hand an indexer with plugins supporting deletions and expecting a URL
and on the other a separate job for converting segments and crawldb to XML,
WARC etc...

Does it make more sense?

Pluggable indexing backends
---

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Markus Jelsma (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163515#comment-13163515
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Ah yes it makes sense now!
 
If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in 
this case, implements the delete method as called from NutchIndexAction. 
Another endpoint could simply ignore and do nothing but write out WARC or Solr 
XML files.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Julien Nioche (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163704#comment-13163704
]

Julien Nioche commented on NUTCH-1047:
--

The class NutchIndexWriter and NutchIndexWriterFactory already provide us with
the type of abstraction we need. We could turn the interface NutchIndexWriter
into an endpoint and add the methods we need (e.g. delete). What is not clear
yet is what IndexerOutputFormat is used for and whether we will be able to use
implementations of NutchIndexWriter from within a plugin.

Pluggable indexing backends
---

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Julien Nioche (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162704#comment-13162704
]

Julien Nioche commented on NUTCH-1047:
--

It would be nice to have a plugin implementing this endpoint to generate WARC
files. There seems to be two different situations though : one where we send
docs to servers (SOLR, ES) and one where we generate files. Do we need to
handle deletions for the latter? I don't think so but we would need to for the
former.

Any thoughts on this? Would it make sense to have 2 different endpoints or not?

Pluggable indexing backends
---

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Markus Jelsma (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162977#comment-13162977
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Hi Julien,

I'm not sure i get your point exactly but if we don't generate WARC files we:
- don't have to think about the problem you state
- don't create an additional process between Nutch and a search engine

If you'd need WARC files, for some reason, i'd rather have an endpoint for it 
just like for ES and Solr instead of using WARC files as an intermediate format.

Does your suggestion imply: segment+crawldb  warc files  search engine? 


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-07-16 Thread Julien Nioche (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13066484#comment-13066484
]

Julien Nioche commented on NUTCH-1047:
--

{quote}
My interest in your last point is a question which I suppose is wide open to
discussion. What end-points (generally speaking) are we going to support and
formally represent as pluggable entities? What criteria do we make decisions
based on?
{quote}

We'll simply port the existing SOLR indexing to the plugin-based architecture
so that people can easily add the backends they need. If there is a widespread
need for a specific backend then I suppose someone will contribute patches and
it might get committed. It's not like we need to define which backends (not
same as endpoints BTW) would be added etc... we are just giving people the
possibility of simply adding theirs without having to do a dirty hack of the
indexer.

There is currently a growing interest for ElasticSearch and I know of at least
one person who's modified the SOLR indexer to get it to work for ES. This would
be a good candidate for inclusion, apart from that let's see what people
contribute.

Pluggable indexing backends
---

Key: NUTCH-1047
URL: https://issues.apache.org/jira/browse/NUTCH-1047
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects Versions: 1.4
Reporter: Julien Nioche
Labels: indexing
Fix For: 1.4

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

41 matches

Mail list logo