[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410136#comment-16410136
 ] 

Ben Vachon commented on NUTCH-1741:
---

Thank you, [~wastl-nagel] and [~yossi]. That makes much more sense now.

I suppose my situation is unusual. Custom protocol, parser, and signature 
plugins in our framework ignore the "http.content.limit" property and stream 
buffered content directly to/from a separately managed content store.

Looking at NutchSitemapParser, I realize that's not going to work with the 
Nutch sitemaps architecture, so I'd like to also change NutchSitemapParser into 
a plugin point (perhaps handled by the ParserFactory) if that makes sense.

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409607#comment-16409607
 ] 

Ben Vachon edited comment on NUTCH-1741 at 3/22/18 3:02 PM:


Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a batch of only 
sitemaps, then fetch and parse that batch of only sitemaps. Then generate a 
batch of only non-sitemaps and fetch and parse that batch.
 What prevents us from allowing Nutch to generate a mixed batch and fetch and 
parse a mixed batch but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control, but for my use-case 
it's really just added overhead. Would anyone be opposed to me adding an 
alternate option when running the GeneratorJob/FetcherJob/ParserJob to create 
and handle mixed batches (basically just making the "-sitemaps" argument flag 
into a ternary operator)?


was (Author: bvachon):
Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a batch of only 
sitemaps, then fetch and parse that batch of only sitemaps. Then generate a 
batch of only non-sitemaps and fetch and parse that batch.
 What prevents us from allowing Nutch to generate a mixed batch and fetch and 
parse a mixed batch but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed batches (basically 
just making the "-sitemaps" argument flag into a ternary operator)?

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409607#comment-16409607
 ] 

Ben Vachon edited comment on NUTCH-1741 at 3/22/18 2:55 PM:


Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a batch of only 
sitemaps, then fetch and parse that batch of only sitemaps. Then generate a 
batch of only non-sitemaps and fetch and parse that batch.
 What prevents us from allowing Nutch to generate a mixed batch and fetch and 
parse a mixed batch but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed batches (basically 
just making the "-sitemaps" argument flag into a ternary operator)?


was (Author: bvachon):
Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a batch of only 
sitemaps, then fetch and parse that batch of only sitemaps. Then generate a 
batch of only non-sitemaps and fetch and parse that batch.
 What prevents us from allowing Nutch to generate a mixed batch and fetch and 
parse a mixed batch but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed batches?

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409607#comment-16409607
 ] 

Ben Vachon edited comment on NUTCH-1741 at 3/22/18 2:53 PM:


Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a batch of only 
sitemaps, then fetch and parse that batch of only sitemaps. Then generate a 
batch of only non-sitemaps and fetch and parse that batch.
 What prevents us from allowing Nutch to generate a mixed batch and fetch and 
parse a mixed batch but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed batches?


was (Author: bvachon):
Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a list of only 
sitemaps, then fetch and parse that list of only sitemaps. Then generate a list 
of only non-sitemaps and fetch and parse that list.
 What prevents us from allowing Nutch to generate a mixed list and fetch and 
parse a mixed list but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed lists?

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409607#comment-16409607
 ] 

Ben Vachon edited comment on NUTCH-1741 at 3/22/18 2:52 PM:


Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
 The way this is working in 2.x right now is that you generate a list of only 
sitemaps, then fetch and parse that list of only sitemaps. Then generate a list 
of only non-sitemaps and fetch and parse that list.
 What prevents us from allowing Nutch to generate a mixed list and fetch and 
parse a mixed list but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

edit:

I read that the idea was to give the user more control. Would anyone be opposed 
to me adding an alternate option when running the 
GeneratorJob/FetcherJob/ParserJob to create and handle mixed lists?


was (Author: bvachon):
Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
The way this is working in 2.x right now is that you generate a list of only 
sitemaps, then fetch and parse that list of only sitemaps. Then generate a list 
of only non-sitemaps and fetch and parse that list.
What prevents us from allowing Nutch to generate a mixed list and fetch and 
parse a mixed list but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409614#comment-16409614
 ] 

Ben Vachon commented on NUTCH-2536:
---

I don't know how this affects fully distributed Nutch systems, and I only just 
saw NUTCH-2328

 

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409607#comment-16409607
 ] 

Ben Vachon commented on NUTCH-1741:
---

Is there a reason that sitemap page handling has to be separate from 
non-sitemap page handling?
The way this is working in 2.x right now is that you generate a list of only 
sitemaps, then fetch and parse that list of only sitemaps. Then generate a list 
of only non-sitemaps and fetch and parse that list.
What prevents us from allowing Nutch to generate a mixed list and fetch and 
parse a mixed list but just handle pages with the InjectType.SITEMAP_INJECT 
differently when parsing?

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: Cihad Guzel
>Priority: Major
>  Labels: gsoc2015
> Fix For: 2.4
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741-webpage-avsc.patch, NUTCH-1741.patch, 
> NUTCH-1741v5.patch, NUTCH-1741v6.patch, NUTCH-1741v7.patch, 
> SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2536:
--
Environment: Non-distributed, single node, standalone Nutch jobs run in a 
sinlge JVM with HBase as the data store. 2.3.1

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2540) Support Generic Deduplication in Nutch 2.x

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2540:
--
Environment: (was: Non-distributed, single node, standalone Nutch jobs 
run in a sinlge JVM with HBase as the data store. 2.3.1)

> Support Generic Deduplication in Nutch 2.x
> --
>
> Key: NUTCH-2540
> URL: https://issues.apache.org/jira/browse/NUTCH-2540
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Ben Vachon
>Priority: Major
>  Labels: dedupe
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that 
> checks for duplicates based on digest and deletes them at index time. I 
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to 
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of 
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store 
> entries, duplicate matching can be configured via extension of the Signature 
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default). 
> If this flag is used, then the IndexingJob will check the Duplicate DataStore 
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
> belongs to the original WebPage, and skip the WebPage if it is not the 
> original, and delete (from the index) the other pages if the WebPage is the 
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL 
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time 
> when determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2540) Support Generic Deduplication in Nutch 2.x

2018-03-21 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2540:
--
Environment: Non-distributed, single node, standalone Nutch jobs run in a 
sinlge JVM with HBase as the data store. 2.3.1

> Support Generic Deduplication in Nutch 2.x
> --
>
> Key: NUTCH-2540
> URL: https://issues.apache.org/jira/browse/NUTCH-2540
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 2.3.1
> Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>Reporter: Ben Vachon
>Priority: Major
>  Labels: dedupe
> Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that 
> checks for duplicates based on digest and deletes them at index time. I 
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to 
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of 
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store 
> entries, duplicate matching can be configured via extension of the Signature 
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default). 
> If this flag is used, then the IndexingJob will check the Duplicate DataStore 
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
> belongs to the original WebPage, and skip the WebPage if it is not the 
> original, and delete (from the index) the other pages if the WebPage is the 
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL 
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time 
> when determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2540) Support Generic Deduplication in Nutch 2.x

2018-03-20 Thread Ben Vachon (JIRA)
Ben Vachon created NUTCH-2540:
-

 Summary: Support Generic Deduplication in Nutch 2.x
 Key: NUTCH-2540
 URL: https://issues.apache.org/jira/browse/NUTCH-2540
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 2.3.1
Reporter: Ben Vachon
 Fix For: 2.4


Currently, deduplication in 2.x exists only as a utility for the Solr index.

My use-case for Nutch required deduplication so I wrote custom code that checks 
for duplicates based on digest and deletes them at index time. I figured I'd 
port the change so that others could use it as well.

This is a very simple approach to Deduplication. There's plenty of room to 
improve it.

This change adds a new DataStore for Duplicate entries that are just lists of 
urls with signatures as keys.

A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
WebPages into the Duplicate DataStore.

Since the key of the Duplicate store is the digest field of the WebPage store 
entries, duplicate matching can be configured via extension of the Signature 
abstract class.

A new "-deduplicate" argument is added to the IndexingJob (false by default). 
If this flag is used, then the IndexingJob will check the Duplicate DataStore 
for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
belongs to the original WebPage, and skip the WebPage if it is not the 
original, and delete (from the index) the other pages if the WebPage is the 
original.

I've also added a BasicDuplicateFilter plugin class that considers the URL with 
the shortest path to be the original.

Eventually, it would be best to consider things like score and fetch time when 
determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-16 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402022#comment-16402022
 ] 

Ben Vachon commented on NUTCH-2536:
---

pull request: https://github.com/apache/nutch/pull/298

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-16 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2536:
--
Comment: was deleted

(was: pull request: https://github.com/apache/nutch/pull/298)

> GeneratorReducer.count is a static variable
> ---
>
> Key: NUTCH-2536
> URL: https://issues.apache.org/jira/browse/NUTCH-2536
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Ben Vachon
>Priority: Minor
>  Labels: Generate
> Fix For: 2.4
>
>   Original Estimate: 2.4h
>  Remaining Estimate: 2.4h
>
> The count field of the GeneratorReducer class is a static field. This means 
> that if the GeneratorJob is run multiple times within the same JVM, it will 
> count all the webpages generated across all batches.
> The count field is checked against the GeneratorJob's topN configuration 
> variable, which is described as:
> "top threshold for maximum number of URLs permitted in a batch"
> I understand this to mean that EACH batch should be capped at the topN value, 
> not ALL batches.
> This isn't a problem with the way that Nutch is typically used because the 
> script starts a new JVM each time. I'm not using the script, I'm calling the 
> java classes directly (using the ToolRunner) within an existing JVM, so I'm 
> categorizing this as an SDK issue.
> Changing the field to be non-static will not affect the behavior of the class 
> as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2536) GeneratorReducer.count is a static variable

2018-03-16 Thread Ben Vachon (JIRA)
Ben Vachon created NUTCH-2536:
-

 Summary: GeneratorReducer.count is a static variable
 Key: NUTCH-2536
 URL: https://issues.apache.org/jira/browse/NUTCH-2536
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.3.1
Reporter: Ben Vachon
 Fix For: 2.4


The count field of the GeneratorReducer class is a static field. This means 
that if the GeneratorJob is run multiple times within the same JVM, it will 
count all the webpages generated across all batches.

The count field is checked against the GeneratorJob's topN configuration 
variable, which is described as:

"top threshold for maximum number of URLs permitted in a batch"

I understand this to mean that EACH batch should be capped at the topN value, 
not ALL batches.

This isn't a problem with the way that Nutch is typically used because the 
script starts a new JVM each time. I'm not using the script, I'm calling the 
java classes directly (using the ToolRunner) within an existing JVM, so I'm 
categorizing this as an SDK issue.

Changing the field to be non-static will not affect the behavior of the class 
as its run by the script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2018-03-09 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393274#comment-16393274
 ] 

Ben Vachon commented on NUTCH-1465:
---

Is there any plan to pull this to 2.x?

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2018-03-09 Thread Ben Vachon (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2292:
--
Comment: was deleted

(was: Yokay, I'll make a pull request from the NUTCH-2292 branch once all the 
changes are in)

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
>Priority: Major
> Fix For: 1.15
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2017-04-07 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961148#comment-15961148
 ] 

Ben Vachon edited comment on NUTCH-2292 at 4/7/17 5:29 PM:
---

Yokay, I'll make a pull request from the NUTCH-2292 branch once all the changes 
are in


was (Author: bvachon):
Yokay, I'll make a pull request once the changes are in for 1.14

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
> Fix For: 1.14
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2017-04-07 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961148#comment-15961148
 ] 

Ben Vachon commented on NUTCH-2292:
---

Yokay, I'll make a pull request once the changes are in for 1.14

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
> Fix For: 1.14
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2292) Mavenize the build for nutch-core and nutch-plugins

2017-04-07 Thread Ben Vachon (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960908#comment-15960908
 ] 

Ben Vachon commented on NUTCH-2292:
---

Can this work be done in the 2.x branch as well?
Will the plugins be published as maven artifacts when it's done so that I can 
just add them to my pom alongside Nutch? Or will they come with the Nutch 
artifact?
Thank you.

> Mavenize the build for nutch-core and nutch-plugins
> ---
>
> Key: NUTCH-2292
> URL: https://issues.apache.org/jira/browse/NUTCH-2292
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Thamme Gowda
>Assignee: Thamme Gowda
> Fix For: 1.14
>
>
> Convert the build system of  nutch-core as well as plugins to Apache Maven.
> *Plan :*
> Create multi-module maven project with the following structure
> {code}
> nutch-parent
>   |-- pom.xml (POM)
>   |-- nutch-core
>   |   |-- pom.xml (JAR)
>   |   |--src: sources
>   |-- nutch-plugins
>   |-- pom.xml (POM)
>   |-- plugin1
>   ||-- pom.xml (JAR)
>   | .
>   |-- pluginN
>|-- pom.xml (JAR)
> {code}
> NOTE: watch out for cyclic dependencies bwteen nutch-core and plugins, 
> introduce another POM to break the cycle if required.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)