[jira] [Created] (NUTCH-1015) can't parse erroneous date: 2006-05-24T20:03:42
can't parse erroneous date: 2006-05-24T20:03:42 --- Key: NUTCH-1015 URL: https://issues.apache.org/jira/browse/NUTCH-1015 Project: Nutch Issue Type: Bug Components: indexer Reporter: Markus Jelsma Fix For: 1.4, 2.0 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
[ https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1015: - Fix Version/s: (was: 1.4) (was: 2.0) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42 --- Key: NUTCH-1015 URL: https://issues.apache.org/jira/browse/NUTCH-1015 Project: Nutch Issue Type: Bug Components: indexer Reporter: Markus Jelsma MoreIndexingFilter must handle the following url's gracefully: {code} can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT can't parse erroneous date: GMT {code} What to do? Default to now? Fetch time? Anything? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42
[ https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1015: - Description: MoreIndexingFilter must handle the following url's gracefully: {code} can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT can't parse erroneous date: GMT {code} was: MoreIndexingFilter must handle the following url's gracefully: {code} can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT can't parse erroneous date: GMT {code} What to do? Default to now? Fetch time? Anything? MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42 --- Key: NUTCH-1015 URL: https://issues.apache.org/jira/browse/NUTCH-1015 Project: Nutch Issue Type: Bug Components: indexer Reporter: Markus Jelsma MoreIndexingFilter must handle the following url's gracefully: {code} can't parse erroneous date: Sun, 27 Jun 2010 06:51:35 GMT+1 can't parse erroneous date: ma, 27 jun 2011 05:15:32 GMT can't parse erroneous date: Mon, 23 May 2011 22:05:58 GMT can't parse erroneous date: GMT {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055479#comment-13055479 ] Markus Jelsma commented on NUTCH-1012: -- Objections? I'd like to send this one in. Cannot handle illegal charset $charset -- Key: NUTCH-1012 URL: https://issues.apache.org/jira/browse/NUTCH-1012 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1012-1.4.patch Pages returning: {code} Content-Type: text/html; charset=$charset {code} cause: {code} Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 {code} Stack trace: {code} 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138) 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662) 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill egalCharsetNameException: $charset {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1012: - Fix Version/s: 2.0 Cannot handle illegal charset $charset -- Key: NUTCH-1012 URL: https://issues.apache.org/jira/browse/NUTCH-1012 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1012-1.4.patch Pages returning: {code} Content-Type: text/html; charset=$charset {code} cause: {code} Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 {code} Stack trace: {code} 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138) 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662) 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill egalCharsetNameException: $charset {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-295) More description for fetcher.threads.fetch property
[ https://issues.apache.org/jira/browse/NUTCH-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-295: --- Assignee: Markus Jelsma (was: Dennis Kubes) More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: https://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Dennis Kubes Assignee: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: fetcher_threads_desc.patch Added some description to the fetcher.threads.fetch property to explain the number of threads running in a cluster. Patch is attached. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1012: - Patch Info: [Patch Available] Cannot handle illegal charset $charset -- Key: NUTCH-1012 URL: https://issues.apache.org/jira/browse/NUTCH-1012 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1012-1.4.patch Pages returning: {code} Content-Type: text/html; charset=$charset {code} cause: {code} Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999 ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 {code} Stack trace: {code} 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437) 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138) 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662) 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill egalCharsetNameException: $charset {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-295) More description for fetcher.threads.fetch property
[ https://issues.apache.org/jira/browse/NUTCH-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-295. --- Finally, a five year old issue resolved ;) More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: https://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Reporter: Dennis Kubes Assignee: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: fetcher_threads_desc.patch Added some description to the fetcher.threads.fetch property to explain the number of threads running in a cluster. Patch is attached. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-956) solrindex issues
[ https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055498#comment-13055498 ] Markus Jelsma commented on NUTCH-956: - Alexis, the first two issues are already in Nutch 1.3 and 2.0. Your content-type fix is for 2.0. What NPE's did you get? I haven't done extensive testing with 2.0 but don't remember seeing NPE. And what suprises do you avoid with the fourth issue? solrindex issues Key: NUTCH-956 URL: https://issues.apache.org/jira/browse/NUTCH-956 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.0 Reporter: Alexis Fix For: 1.4, 2.0 Attachments: solr.patch I ran into a few caveats with solrindex command trying to index documents. Please refer to http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that describes my tests. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055530#comment-13055530 ] Markus Jelsma edited comment on NUTCH-961 at 6/27/11 1:16 PM: -- Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Anyone has a good suggestion on where to fetch our outlinks with the anchors from? was (Author: markus17): Patch to include mark up from Tika. Anchors are now detected but less outlinks are found! Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Fix For: 1.4, 2.0 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1016: - Attachment: NUTCH-1016-1.4.patch Patch for 1.4. Strip UTF-8 non-character codepoints Key: NUTCH-1016 URL: https://issues.apache.org/jira/browse/NUTCH-1016 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1016-1.4.patch During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception: {code} SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) {code} Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] Please comment! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1016) Strip UTF-8 non-character codepoints
Strip UTF-8 non-character codepoints Key: NUTCH-1016 URL: https://issues.apache.org/jira/browse/NUTCH-1016 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1016-1.4.patch During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception: {code} SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) {code} Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] Please comment! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1016: - Attachment: (was: NUTCH-1016-1.4.patch) Strip UTF-8 non-character codepoints Key: NUTCH-1016 URL: https://issues.apache.org/jira/browse/NUTCH-1016 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception: {code} SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) {code} Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] Please comment! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1016: - Attachment: NUTCH-1016-1.4-2.patch Silly me again, the patch was wrong. changed OR's to AND's! This patch also includes more verbose output of the SolrWriter class. Handy for batches of many thousands of documents. This patch doesn't include change to log4j.properties though. Should i get rid of the logging? Keep it? Strip UTF-8 non-character codepoints Key: NUTCH-1016 URL: https://issues.apache.org/jira/browse/NUTCH-1016 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Attachments: NUTCH-1016-1.4-2.patch During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception: {code} SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0x at char #1142033, byte #1155068) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) {code} Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] Please comment! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1017) Exception getting mime type by name
Exception getting mime type by name --- Key: NUTCH-1017 URL: https://issues.apache.org/jira/browse/NUTCH-1017 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.4, 2.0 Large crawls of `bad` websites tend to produce a lot of parsing errors. One of them is related to retrieving mime types, so it seems: {code} WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: Invalid media type name: WEBSITE_CONTENT Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: Invalid media type name: WEBSITE_CONTENT Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [Mime-Type]: Message: Invalid media type name: Mime-Type Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [WEBSITE_CONTENT]: Message: Invalid media type name: WEBSITE_CONTENT Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName WARNING: Exception getting mime type by name: [text/html charset=utf-8]: Message: Invalid media type name: text/html charset=utf-8 {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1018) Solr Document Size Limit
[ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055731#comment-13055731 ] Markus Jelsma commented on NUTCH-1018: -- This might be useful but maybe not as a Solr option but as an indexing plugin. This way other future back ends such as ES would also benefit. However, in Solr you can copyField a source to a destination field and specify how many chars are to be copied over. This yields the same result. Solr Document Size Limit Key: NUTCH-1018 URL: https://issues.apache.org/jira/browse/NUTCH-1018 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Mark Achee Priority: Minor Labels: solr There should be an option, perhaps named solr.content.limit, that defines the max size of documents added to Solr. I've had issues with large documents in Solr, so I set the file.content.limit to 2MB. However, this causes many files to not be parsed (mostly PDFs) because of only retrieving parts of the document. With this new option, I could still correctly parse them, but only index the first 2MB (or however large it is set) in Solr. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of bin/nutch_crawl by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The bin/nutch_crawl page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_crawl?action=diffrev1=11rev2=12 Comment: Formatting for easier reading Usage: {{{ - bin/nutch org.apache.nutch.crawl.!Crawl (-local | -ndfs nameserver:port) dir_with_url_files [-threads n] [-depth i] [-showThreadID] + bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) dir_with_url_files [-threads n] [-depth i] [-showThreadID] }}} - dir_with_url_files: Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls + '''dir_with_url_files''': Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls - [-threads n]: This parameter enables you to choose how many threads Nutch should use when crawling. + '''[-threads n]''': This parameter enables you to choose how many threads Nutch should use when crawling. - [-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. + '''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you pass –depth 1 as the parameter, Nutch will only index the first level. If you say –depth 2 (or more) Nutch will follow this number of outlinks. - [-showThreadID]: + '''[-showThreadID]''': - -local + '''-local''': - -ndfs nameserver:port + '''-ndfs nameserver:port''': CommandLineOptions
[jira] [Updated] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
[ https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1019: Summary: Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy (was: Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of legacy) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy - Key: NUTCH-1019 URL: https://issues.apache.org/jira/browse/NUTCH-1019 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, 2.0 Reporter: Lewis John McGibbney Priority: Trivial Fix For: 1.4, 2.0 When updating the wiki documentation for command line options, I noticed that the comment on line 51 of the above class is inaccurate and needs to be updated to reflect changes. Although this is a trivial task I won't be able to committ until 2nd week July. Can I ask someone else please? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1019) Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of legacy
Edit comment in org.apache.nutc.crawl.Crawl to reflect removal of legacy Key: NUTCH-1019 URL: https://issues.apache.org/jira/browse/NUTCH-1019 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, 2.0 Reporter: Lewis John McGibbney Priority: Trivial Fix For: 1.4, 2.0 When updating the wiki documentation for command line options, I noticed that the comment on line 51 of the above class is inaccurate and needs to be updated to reflect changes. Although this is a trivial task I won't be able to committ until 2nd week July. Can I ask someone else please? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of bin/nutch_crawl by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The bin/nutch_crawl page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_crawl?action=diffrev1=13rev2=14 Usage: {{{ - bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) dir_with_url_files [-threads n] [-depth i] [-showThreadID] + bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs nameserver:port) dir_with_url_files [-threads n] [-depth i] [-showThreadID] [-solrindex s] }}} '''dir_with_url_files''': Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls @@ -16, +16 @@ '''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you pass –depth 1 as the parameter, Nutch will only index the first level. If you say –depth 2 (or more) Nutch will follow this number of outlinks. - '''[-solrindex p]''': Enables us to pass our Solr instance as an indexing parameter to simplify the process of indexing with Solr. + '''[-solrindex s]''': Enables us to pass our Solr instance as an indexing parameter to simplify the process of indexing with Solr. '''[-showThreadID]''':