[jira] [Created] (NUTCH-1015) can't parse erroneous date: 2006-05-24T20:03:42

2011-06-27 Thread Markus Jelsma (JIRA)
can't parse erroneous date: 2006-05-24T20:03:42
---

 Key: NUTCH-1015
 URL: https://issues.apache.org/jira/browse/NUTCH-1015
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Markus Jelsma
 Fix For: 1.4, 2.0




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1015:
-

Fix Version/s: (was: 1.4)
   (was: 2.0)

 MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
 ---

 Key: NUTCH-1015
 URL: https://issues.apache.org/jira/browse/NUTCH-1015
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Markus Jelsma

 MoreIndexingFilter must handle the following url's gracefully:
 {code}
 can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1
 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT
 can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT
 can't parse erroneous date: GMT
 {code}
 What to do? Default to now? Fetch time? Anything? 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1015:
-

Description: 
MoreIndexingFilter must handle the following url's gracefully:

{code}
can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1
can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT
can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT
can't parse erroneous date: GMT
{code}

  was:
MoreIndexingFilter must handle the following url's gracefully:

{code}
can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1
can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT
can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT
can't parse erroneous date: GMT
{code}

What to do? Default to now? Fetch time? Anything? 


 MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
 ---

 Key: NUTCH-1015
 URL: https://issues.apache.org/jira/browse/NUTCH-1015
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Reporter: Markus Jelsma

 MoreIndexingFilter must handle the following url's gracefully:
 {code}
 can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1
 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT
 can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT
 can't parse erroneous date: GMT
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-27 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055479#comment-13055479
 ] 

Markus Jelsma commented on NUTCH-1012:
--

Objections? I'd like to send this one in.

 Cannot handle illegal charset $charset
 --

 Key: NUTCH-1012
 URL: https://issues.apache.org/jira/browse/NUTCH-1012
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1012-1.4.patch


 Pages returning:
 {code}
 Content-Type: text/html; charset=$charset
 {code}
 cause:
 {code}
 Error parsing: http://host/: failed(2,200): 
 java.nio.charset.IllegalCharsetNameException: $charset
 Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
 Followed by 3999
 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
 {code}
 Stack trace:
 {code}
 2011-06-24 01:14:23,442 WARN  parse.html - 
 java.nio.charset.IllegalCharsetNameException: $charset
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.checkName(Charset.java:284)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup2(Charset.java:458)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup(Charset.java:437)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.isSupported(Charset.java:479)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.lang.Thread.run(Thread.java:662)
 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
 http://host/: failed(2,200): java.nio.charset.Ill
 egalCharsetNameException: $charset
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1012:
-

Fix Version/s: 2.0

 Cannot handle illegal charset $charset
 --

 Key: NUTCH-1012
 URL: https://issues.apache.org/jira/browse/NUTCH-1012
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1012-1.4.patch


 Pages returning:
 {code}
 Content-Type: text/html; charset=$charset
 {code}
 cause:
 {code}
 Error parsing: http://host/: failed(2,200): 
 java.nio.charset.IllegalCharsetNameException: $charset
 Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
 Followed by 3999
 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
 {code}
 Stack trace:
 {code}
 2011-06-24 01:14:23,442 WARN  parse.html - 
 java.nio.charset.IllegalCharsetNameException: $charset
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.checkName(Charset.java:284)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup2(Charset.java:458)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup(Charset.java:437)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.isSupported(Charset.java:479)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.lang.Thread.run(Thread.java:662)
 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
 http://host/: failed(2,200): java.nio.charset.Ill
 egalCharsetNameException: $charset
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (NUTCH-295) More description for fetcher.threads.fetch property

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-295:
---

Assignee: Markus Jelsma  (was: Dennis Kubes)

 More description for fetcher.threads.fetch property
 ---

 Key: NUTCH-295
 URL: https://issues.apache.org/jira/browse/NUTCH-295
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Dennis Kubes
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: fetcher_threads_desc.patch


 Added some description to the fetcher.threads.fetch property to explain the 
 number of threads running in a cluster. Patch is attached.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1012:
-

Patch Info: [Patch Available]

 Cannot handle illegal charset $charset
 --

 Key: NUTCH-1012
 URL: https://issues.apache.org/jira/browse/NUTCH-1012
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1012-1.4.patch


 Pages returning:
 {code}
 Content-Type: text/html; charset=$charset
 {code}
 cause:
 {code}
 Error parsing: http://host/: failed(2,200): 
 java.nio.charset.IllegalCharsetNameException: $charset
 Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
 Followed by 3999
 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
 {code}
 Stack trace:
 {code}
 2011-06-24 01:14:23,442 WARN  parse.html - 
 java.nio.charset.IllegalCharsetNameException: $charset
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.checkName(Charset.java:284)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup2(Charset.java:458)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.lookup(Charset.java:437)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 java.nio.charset.Charset.isSupported(Charset.java:479)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
 2011-06-24 01:14:23,442 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.util.concurrent.FutureTask.run(FutureTask.java:138)
 2011-06-24 01:14:23,443 WARN  parse.html - at 
 java.lang.Thread.run(Thread.java:662)
 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
 http://host/: failed(2,200): java.nio.charset.Ill
 egalCharsetNameException: $charset
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-295) More description for fetcher.threads.fetch property

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-295.
---


Finally, a five year old issue resolved ;)

 More description for fetcher.threads.fetch property
 ---

 Key: NUTCH-295
 URL: https://issues.apache.org/jira/browse/NUTCH-295
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Dennis Kubes
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: fetcher_threads_desc.patch


 Added some description to the fetcher.threads.fetch property to explain the 
 number of threads running in a cluster. Patch is attached.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-956) solrindex issues

2011-06-27 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055498#comment-13055498
 ] 

Markus Jelsma commented on NUTCH-956:
-

Alexis, the first two issues are already in Nutch 1.3 and 2.0. Your 
content-type fix is for 2.0. What NPE's did you get? I haven't done extensive 
testing with 2.0 but don't remember seeing NPE. And what suprises do you avoid 
with the fourth issue?



 solrindex issues
 

 Key: NUTCH-956
 URL: https://issues.apache.org/jira/browse/NUTCH-956
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.0
Reporter: Alexis
 Fix For: 1.4, 2.0

 Attachments: solr.patch


 I ran into a few caveats with solrindex command trying to index documents.
 Please refer to 
 http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
 describes my tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

2011-06-27 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055530#comment-13055530
 ] 

Markus Jelsma edited comment on NUTCH-961 at 6/27/11 1:16 PM:
--

Patch to include mark up from Tika. Anchors are now detected but less outlinks 
are found! Anyone has a good suggestion on where to fetch our outlinks with the 
anchors from?

  was (Author: markus17):
Patch to include mark up from Tika. Anchors are now detected but less 
outlinks are found!
  
 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: NUTCH-1016-1.4.patch

Patch for 1.4.

 Strip UTF-8 non-character codepoints
 

 Key: NUTCH-1016
 URL: https://issues.apache.org/jira/browse/NUTCH-1016
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1016-1.4.patch


 During a very large crawl i found a few documents producing non-character 
 codepoints. When indexing to Solr this will yield the following exception:
 {code}
 SEVERE: java.lang.RuntimeException: [was class 
 java.io.CharConversionException] Invalid UTF-8 character 0x at char 
 #1142033, byte #1155068)
 at 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at 
 com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 {code}
 Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
 content field to a method to strip away non-characters. I'm not too sure 
 about this implementation but the tests i've done locally with a huge dataset 
 now passes correctly. Here's a list of codepoints to strip away: 
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
 Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-27 Thread Markus Jelsma (JIRA)
Strip UTF-8 non-character codepoints


 Key: NUTCH-1016
 URL: https://issues.apache.org/jira/browse/NUTCH-1016
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0
 Attachments: NUTCH-1016-1.4.patch

During a very large crawl i found a few documents producing non-character 
codepoints. When indexing to Solr this will yield the following exception:

{code}
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0x at char #1142033, byte #1155068)
at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}

Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
content field to a method to strip away non-characters. I'm not too sure about 
this implementation but the tests i've done locally with a huge dataset now 
passes correctly. Here's a list of codepoints to strip away: 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: (was: NUTCH-1016-1.4.patch)

 Strip UTF-8 non-character codepoints
 

 Key: NUTCH-1016
 URL: https://issues.apache.org/jira/browse/NUTCH-1016
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0


 During a very large crawl i found a few documents producing non-character 
 codepoints. When indexing to Solr this will yield the following exception:
 {code}
 SEVERE: java.lang.RuntimeException: [was class 
 java.io.CharConversionException] Invalid UTF-8 character 0x at char 
 #1142033, byte #1155068)
 at 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at 
 com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 {code}
 Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
 content field to a method to strip away non-characters. I'm not too sure 
 about this implementation but the tests i've done locally with a huge dataset 
 now passes correctly. Here's a list of codepoints to strip away: 
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
 Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-27 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: NUTCH-1016-1.4-2.patch

Silly me again, the patch was wrong. changed OR's to AND's!

This patch also includes more verbose output of the SolrWriter class. Handy for 
batches of many thousands of documents. This patch doesn't include change to 
log4j.properties though.

Should i get rid of the logging? Keep it?

 Strip UTF-8 non-character codepoints
 

 Key: NUTCH-1016
 URL: https://issues.apache.org/jira/browse/NUTCH-1016
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1016-1.4-2.patch


 During a very large crawl i found a few documents producing non-character 
 codepoints. When indexing to Solr this will yield the following exception:
 {code}
 SEVERE: java.lang.RuntimeException: [was class 
 java.io.CharConversionException] Invalid UTF-8 character 0x at char 
 #1142033, byte #1155068)
 at 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at 
 com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 {code}
 Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
 content field to a method to strip away non-characters. I'm not too sure 
 about this implementation but the tests i've done locally with a huge dataset 
 now passes correctly. Here's a list of codepoints to strip away: 
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
 Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1017) Exception getting mime type by name

2011-06-27 Thread Markus Jelsma (JIRA)
Exception getting mime type by name
---

 Key: NUTCH-1017
 URL: https://issues.apache.org/jira/browse/NUTCH-1017
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0


Large crawls of `bad` websites tend to produce a lot of parsing errors. One of 
them is related to retrieving mime types, so it seems:

{code}
WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: 
Invalid media type name: WEBSITE_CONTENT
Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: 
Invalid media type name: WEBSITE_CONTENT
Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
WARNING: Exception getting mime type by name: [Mime-Type]: Message: Invalid 
media type name: Mime-Type
Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: 
Invalid media type name: WEBSITE_CONTENT
Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
WARNING: Exception getting mime type by name: [text/html charset=utf-8]: 
Message: Invalid media type name: text/html charset=utf-8
{code}



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1018) Solr Document Size Limit

2011-06-27 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055731#comment-13055731
 ] 

Markus Jelsma commented on NUTCH-1018:
--

This might be useful but maybe not as a Solr option but as an indexing plugin. 
This way other future back ends such as ES would also benefit. 

However, in Solr you can copyField a source to a destination field and specify 
how many chars are to be copied over. This yields the same result.

 Solr Document Size Limit
 

 Key: NUTCH-1018
 URL: https://issues.apache.org/jira/browse/NUTCH-1018
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Mark Achee
Priority: Minor
  Labels: solr

 There should be an option, perhaps named solr.content.limit, that defines the 
 max size of documents added to Solr.  I've had issues with large documents in 
 Solr, so I set the file.content.limit to 2MB.  However, this causes many 
 files to not be parsed (mostly PDFs) because of only retrieving parts of the 
 document.  With this new option, I could still correctly parse them, but only 
 index the first 2MB (or however large it is set) in Solr.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of bin/nutch_crawl by LewisJohnMcgibbney

2011-06-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The bin/nutch_crawl page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_crawl?action=diffrev1=11rev2=12

Comment:
Formatting for easier reading

  
  Usage: 
  {{{
- bin/nutch org.apache.nutch.crawl.!Crawl (-local | -ndfs nameserver:port) 
dir_with_url_files [-threads n] [-depth i] [-showThreadID]
+ bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) 
dir_with_url_files [-threads n] [-depth i] [-showThreadID]
  }}}
  
- dir_with_url_files: Contains text files with URL lists. This must be an 
existing directory. Example would be ${NUTCH_HOME}/urls
+ '''dir_with_url_files''': Contains text files with URL lists. This must be 
an existing directory. Example would be ${NUTCH_HOME}/urls
  
- [-threads n]: This parameter enables you to choose how many threads Nutch 
should use when crawling.
+ '''[-threads n]''': This parameter enables you to choose how many threads 
Nutch should use when crawling.
  
- [-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell 
Nutch a value, it takes 5 as his standard parameter. 
+ '''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t 
tell Nutch a value, it takes 5 as his standard parameter. 
  For example if you pass –depth 1 as the parameter, Nutch will only index the 
first level. If you say –depth 2 (or more) Nutch will follow this number of 
outlinks.
  
- [-showThreadID]: 
+ '''[-showThreadID]''': 
  
- -local
+ '''-local''':
  
- -ndfs nameserver:port
+ '''-ndfs nameserver:port''':
  
  
  CommandLineOptions


[jira] [Updated] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy

2011-06-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1019:


Summary: Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of 
legacy  (was: Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of 
legacy)

 Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
 -

 Key: NUTCH-1019
 URL: https://issues.apache.org/jira/browse/NUTCH-1019
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4, 2.0


 When updating the wiki documentation for command line options, I noticed that 
 the comment on line 51 of the above class is inaccurate and needs to be 
 updated to reflect changes. Although this is a trivial task I won't be able 
 to committ until 2nd week July. Can I ask someone else please?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1019) Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of legacy

2011-06-27 Thread Lewis John McGibbney (JIRA)
Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of legacy


 Key: NUTCH-1019
 URL: https://issues.apache.org/jira/browse/NUTCH-1019
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, 2.0
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4, 2.0


When updating the wiki documentation for command line options, I noticed that 
the comment on line 51 of the above class is inaccurate and needs to be updated 
to reflect changes. Although this is a trivial task I won't be able to committ 
until 2nd week July. Can I ask someone else please?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of bin/nutch_crawl by LewisJohnMcgibbney

2011-06-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The bin/nutch_crawl page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_crawl?action=diffrev1=13rev2=14

  
  Usage: 
  {{{
- bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) 
dir_with_url_files [-threads n] [-depth i] [-showThreadID]
+ bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) 
dir_with_url_files [-threads n] [-depth i] [-showThreadID] [-solrindex s]
  }}}
  
  '''dir_with_url_files''': Contains text files with URL lists. This must be 
an existing directory. Example would be ${NUTCH_HOME}/urls
@@ -16, +16 @@

  '''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t 
tell Nutch a value, it takes 5 as his standard parameter. 
  For example if you pass –depth 1 as the parameter, Nutch will only index the 
first level. If you say –depth 2 (or more) Nutch will follow this number of 
outlinks.
  
- '''[-solrindex p]''': Enables us to pass our Solr instance as an indexing 
parameter to simplify the process of indexing with Solr.
+ '''[-solrindex s]''': Enables us to pass our Solr instance as an indexing 
parameter to simplify the process of indexing with Solr.
  
  '''[-showThreadID]''':