[
https://issues.apache.org/jira/browse/NUTCH-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
lufeng updated NUTCH-1529:
--------------------------
Attachment: NUTCH-1529-trunk-v3.patch
@Lewis add the mongodb dependency in ivy.xml
@Tejas It will write the urls and another fields like fetchInterval to standard
output like DmozParser does.
Command like:
mkdir mongodb
bin/nutch org.apache.nutch.tools.MongodbParser
mongodb://192.168.166.62:50124/crawldb -collection urls -fields
url,score,fetchInterval -outputFieldNames ,nutch.score,nutch.fetchInterval
-query url:apache -queryRegex -sortBy score > mongodb/urls
this means it will connect the crawldb database and get urls collection,
retrieval fields are url,score,fetchInterval , for each retrieval fields, the
output keys are "",nutch.score,nutch.fetchInterval, and query field is url with
regex pattern "apache", and all records are sorted by score.
output may like this:
http://apache.com nutch.score=2.0 nutch.fetchInterval=3000
http://tomcat.apache.org nutch.score=1.0 nutch.fetchInterval=10000
Thanks Lewis and Tejas
> Port nutch-mongdb-parser to trunk
> ---------------------------------
>
> Key: NUTCH-1529
> URL: https://issues.apache.org/jira/browse/NUTCH-1529
> Project: Nutch
> Issue Type: Bug
> Components: injector
> Affects Versions: 1.6
> Reporter: Lewis John McGibbney
> Assignee: lufeng
> Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1529-trunk.patch, NUTCH-1529-trunk-v2.patch,
> NUTCH-1529-trunk-v3.patch
>
>
> The initial repos is here [0]
> [0] https://github.com/ctjmorgan/nutch-mongdb-parser
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira