[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2020-08-16 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178620#comment-17178620
 ] 

Hudson commented on NUTCH-2496:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #3 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/3/])
NUTCH-2496 Speed up link inversion step in crawling script (snagel: 
[https://github.com/apache/nutch/commit/ea6b2f08024fe98ffc62269fdb6f6c700b8f177e])
* (edit) src/bin/crawl


> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2020-06-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129537#comment-17129537
 ] 

ASF GitHub Bot commented on NUTCH-2496:
---

sebastian-nagel merged pull request #527:
URL: https://github.com/apache/nutch/pull/527


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2020-06-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129192#comment-17129192
 ] 

Hudson commented on NUTCH-2496:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3684 (See 
[https://builds.apache.org/job/Nutch-trunk/3684/])
NUTCH-2496 Speed up link inversion step in crawling script (snagel: 
[https://github.com/apache/nutch/commit/7fba6df55d0db81a05958983ba704823c2dff07e])
* (edit) src/bin/crawl


> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2020-05-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108478#comment-17108478
 ] 

ASF GitHub Bot commented on NUTCH-2496:
---

sebastian-nagel opened a new pull request #527:
URL: https://github.com/apache/nutch/pull/527


   - disable URL filtering and normalizing when calling invertlinks in bin/crawl
   
   - add note that the steps invertlinks, dedup, index could also be done 
outside the loop over all segments created in the loop iterations
   
   - move webgraph construction (commented out anyway) outside the loop because 
it's done over all available segments



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.15
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-17 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329760#comment-16329760
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

Thanks again for clearing things up even more.

One last question about the "changing normalizers and/or filters" though: What 
happens if I change let's say my filters and after that I do a full re-crawl 
(inject - generate - fetch - parse - update - link inversion - index - index 
cleanup) without having filtering turned on in my link inversion step? Would 
Nutch take into account the new filters and eventually drop documents that do 
not match the filters anymore from my index?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326999#comment-16326999
 ] 

Markus Jelsma commented on NUTCH-2496:
--

Yes it makes a lot of sense to disable it everywhere except when running the 
injector and during the parse phase (or fetch phase if you parse during fetch).

 

Parse and inject are the points of entry, that is where new records are added 
to the DB, that is where you want to filter.

 

You only need to filter in an other phase when you have changed your 
normalizers and/or filters.

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-15 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326640#comment-16326640
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

[~markus17]: Thanks for that hint. This is something I still don't really get. 
Where and to what steps exactly are those filters/normalizers being applied?

In my case I only have a {{regex-urlfilter.txt}} file as well as the following 
plugin configuration:
{code:xml}

plugin.includes


protocol-httpclient|protocol-http|urlfilter-regex|index-(basic|anchor|metadata)|headings|language-identifier|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|urlnormalizer-(pass|regex|basic)


{code}

Would it make sense to disable filtering/normalization in LinkDB?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325053#comment-16325053
 ] 

Markus Jelsma commented on NUTCH-2496:
--

If you use the same filters/normalizers everywhere in Nutch, you don't need to 
enable filtering/normalization in LinkDB. Unless you usecase is very specific, 
disable it.

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

2018-01-12 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324737#comment-16324737
 ] 

Moreno Feltscher commented on NUTCH-2496:
-

One thing I found out is that if I do the link inversion step after all the 
iterations are done it takes a lot less time. Would it be feasible to move the 
link inversion and indexing step out of the loop and do it only once in the 
end? Any thoughts about this?

> Speed up link inversion step in crawling script
> ---
>
> Key: NUTCH-2496
> URL: https://issues.apache.org/jira/browse/NUTCH-2496
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>
> While working on a project where I have to index a huge number of URLs I 
> encountered an issue with the link inversion step of the crawling script. A 
> while ago Ian Lopata stumbled upon the same issue as described here: 
> http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters 
> during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could 
> improve things in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)