[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711616#comment-13711616
 ] 

Markus Jelsma commented on NUTCH-1228:
--

YEs but this is dependant on upgrades to the new Hadoop MapReduce api vs the 
current mapred. I did some modifications and issues but stumbled on some major 
problems. We should not fix this issue until we're sure the new MapReduce api 
has all features. I remember issues with MapFile api's etc. I think i left some 
comments about that here and on the (unanswered) Hadoop list.

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1228-2.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1612) Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3

2013-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711612#comment-13711612
 ] 

Markus Jelsma commented on NUTCH-1612:
--

We tried 1.0.2 and had a miserable time there. 1.2.0 fixed all major issues 
with Hadoop.

> Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3
> ---
>
> Key: NUTCH-1612
> URL: https://issues.apache.org/jira/browse/NUTCH-1612
> Project: Nutch
>  Issue Type: Bug
> Environment: Ubuntu 64 bit, nutch 2.2, hadoop 1.0.3,hbase-0.90.3
>Reporter: Amit Yadav
>
> When I start crawling using bin/crawl  I am getting "URLMalfomed Exception". 
> I am using Hbase as data store. I can see that the WebTable is created in the 
> Hbase.
> I am able to run the same in local mode.
> Any help on this would be appreciable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Ferdy Galema (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711529#comment-13711529
 ] 

Ferdy Galema commented on NUTCH-1457:
-

Ok cool. Like Lewis said it would be best to create patches that we can apply 
to the trunk codebase, so that there can be no misconceptions when committing 
the changes.

Thanks.

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
 ] 

Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM:
-

Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution(UNSCHEDULED status/mark in 
GeneratorMapper) could be "We are updating the few columns data of all the urls 
(SCHEDULED + UNSCHEDULED) in Hbase"  from ??GeneratorReducer??, that might 
reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)

  was (Author: riyaz):
Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution could be "We are updating the few 
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase"  from 
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)
  
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
 ] 

Riyaz Shaik commented on NUTCH-1457:


Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution could be "We are updating the few 
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase"  from 
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711496#comment-13711496
 ] 

Lewis John McGibbney commented on NUTCH-1228:
-

Does thsi affect 2.x? There is no item description.

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1228-2.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1612) Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3

2013-07-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1612.
---

Resolution: Cannot Reproduce

Amit, please go to the user list for queries such as this. If, over there, we 
find that this is a bug, we will then come to the Jira tracker and file an 
issue.
Please describe your set up and print as much useful content from your logs as 
possible.
Thank you

> Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3
> ---
>
> Key: NUTCH-1612
> URL: https://issues.apache.org/jira/browse/NUTCH-1612
> Project: Nutch
>  Issue Type: Bug
> Environment: Ubuntu 64 bit, nutch 2.2, hadoop 1.0.3,hbase-0.90.3
>Reporter: Amit Yadav
>
> When I start crawling using bin/crawl  I am getting "URLMalfomed Exception". 
> I am using Hbase as data store. I can see that the WebTable is created in the 
> Hbase.
> I am able to run the same in local mode.
> Any help on this would be appreciable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2013-07-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1613:


Fix Version/s: 2.3

> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> 
>
> Key: NUTCH-1613
> URL: https://issues.apache.org/jira/browse/NUTCH-1613
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: patch
> Fix For: 2.3
>
> Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95)
>   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>   at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> 
> http.cookie_string
> XX_AL=authorization_value_goes_here
>   String to use as the cookie value for HTTP 
> requests
> 
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1300) Indexer to filter and normalize URL's

2013-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1300:
-

Summary: Indexer to filter and normalize URL's  (was: Indexer to normalize 
URL's)

> Indexer to filter and normalize URL's
> -
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711435#comment-13711435
 ] 

Markus Jelsma commented on NUTCH-1614:
--

Most if not all filters/normalizers support setting a config file so you can 
use a different filter/normalize config file per stage. This way you can use a 
different set of regex rules during fetch/update and another during indexing. 
You'll have to check each plugin's code to know the exact configuration 
parameter you need to point it to a different config file.

I think this was never ported to 2.x so i think it's better to first port the 
pluggable indexing backends from 1.x to 2.x and then let it also support 
filtering and normalizing.

Also, NUTCH-1300's title is wrong, it should be normalizing AND filtering. If 
you check the patch you'll see it's actually about both. 

> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (NUTCH-1300) Indexer to normalize URL's

2013-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reopened NUTCH-1300:
--


> Indexer to normalize URL's
> --
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1300) Indexer to filter and normalize URL's

2013-07-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1300.
--

Resolution: Fixed

renamed issue for clarity.

> Indexer to filter and normalize URL's
> -
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365
 ] 

Brian edited comment on NUTCH-1614 at 7/17/13 6:29 PM:
---

Can you please tell me how to do this?  I couldn't find anything about how to 
do this.  From what I can tell URL filters apply to crawling not just indexing 
I couldn't see how to apply it to only indexing.  I don't see how normalizing a 
URL would help in this case if it still filters the URL from the crawl and not 
just indexing.

I see an option with the solrindex command, but it appears to be only available 
in nutch 1.x.  Even if it were in 2.x it is not clear how to use the option to 
achieve the desired effect from the documentation:
http://wiki.apache.org/nutch/bin/nutch%20solrindex


  was (Author: brian44):
Can you please tell me how to do this?  I couldn't find anything about how 
to do this.  From what I can tell URL filters apply to crawling not just 
indexing I couldn't see how to apply it to only indexing.  I don't see how 
normalizing a URL would help in this case if it still filters the URL from the 
crawl and not just indexing.
  
> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365
 ] 

Brian commented on NUTCH-1614:
--

Can you please tell me how to do this?  I couldn't find anything about how to 
do this.  From what I can tell URL filters apply to crawling not just indexing 
I couldn't see how to apply it to only indexing.  I don't see how normalizing a 
URL would help in this case if it still filters the URL from the crawl and not 
just indexing.

> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711308#comment-13711308
 ] 

Markus Jelsma commented on NUTCH-1614:
--

You can already do this since Nutch 1.5. It doesn't need any special plugins 
and just reuses both filtering and normalizing systems in Nutch.

> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Brian (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian updated NUTCH-1614:
-

Attachment: NUTCH-1614.patch

> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2013-07-17 Thread Brian (JIRA)
Brian created NUTCH-1614:


 Summary: Plugin to exclude URLs matching regex list from indexing 
- to enable crawl but do not index
 Key: NUTCH-1614
 URL: https://issues.apache.org/jira/browse/NUTCH-1614
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.2.1
Reporter: Brian
Priority: Minor


Some pages we need to crawl (such as some main pages and different views of a 
main page) to get all the other pages, but we don't want to index those pages 
themselves.  Therefore we cannot use the url filter approach.

This plugin uses a file containing regex strings (see included sample file).  
If one of the regex strings matches with an entire URL, that URL will be 
excluded form indexing.

The file to use is specified by the following property in nutch-site.xml:


indexer.url.filter.exclude.regex.file
regex-indexer-exclude-urls.txt

Holds the file name containing the regex strings.  Any URL matching 
one of these strings will be excluded from indexing. 
"#" indicates a comment line and will be ignored.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2013-07-17 Thread Brian (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711223#comment-13711223
 ] 

Brian commented on NUTCH-1613:
--

Yes, if set it will be included in requests for all URLs.  

> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> 
>
> Key: NUTCH-1613
> URL: https://issues.apache.org/jira/browse/NUTCH-1613
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: patch
> Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95)
>   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>   at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> 
> http.cookie_string
> XX_AL=authorization_value_goes_here
>   String to use as the cookie value for HTTP 
> requests
> 
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols

2013-07-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711150#comment-13711150
 ] 

lufeng commented on NUTCH-1613:
---

Does this specified cookie string will effect all crawling urls? 

> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> 
>
> Key: NUTCH-1613
> URL: https://issues.apache.org/jira/browse/NUTCH-1613
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: patch
> Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>   at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>   at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>   at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>   at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95)
>   at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>   at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>   at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> 
> http.cookie_string
> XX_AL=authorization_value_goes_here
>   String to use as the cookie value for HTTP 
> requests
> 
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gül Ahmet Türkoğlu updated NUTCH-1228:
--

Attachment: NUTCH-1228-2.1.patch

I change mapred.task.timeout to mapreduce.task.timeout in fetcher.

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
> Attachments: NUTCH-1228-2.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gül Ahmet Türkoğlu updated NUTCH-1228:
--

Attachment: NUTCH-1228-2.1.patch

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gül Ahmet Türkoğlu updated NUTCH-1228:
--

Attachment: (was: NUTCH-1228-2.1.patch)

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2013-07-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gül Ahmet Türkoğlu updated NUTCH-1228:
--

Attachment: NUTCH-1228-2.1.patch

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.9
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira