[
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178620#comment-17178620
]
Hudson commented on NUTCH-2496:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #3 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/3/])
NUTCH-2496 Speed up link inversion step in crawling script (snagel:
[https://github.com/apache/nutch/commit/ea6b2f08024fe98ffc62269fdb6f6c700b8f177e])
* (edit) src/bin/crawl
> Speed up link inversion step in crawling script
> -----------------------------------------------
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
> Issue Type: Improvement
> Components: linkdb
> Affects Versions: 1.15
> Reporter: Moreno Feltscher
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.17
>
>
> While working on a project where I have to index a huge number of URLs I
> encountered an issue with the link inversion step of the crawling script. A
> while ago Ian Lopata stumbled upon the same issue as described here:
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a
> single node. I run invertlinks only because I need the Inlinks in the
> indexer step so as to store them with the document. I do not need the
> anchor text and I am not scoring. I am finding that invertlinks (and more
> specifically the merge of the linkdb) takes a long time - about 30 minutes
> for a crawl of around 150K documents. I am looking for ways that I might
> shorten this processing time. Any suggestions?
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could
> improve things in a crawl script and speed up the process.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)