parsing a simple text node
Hi there, i am working on a plugin to fetch some structured information (e.g., product price) in web pages, and I had some problem parsing the following simple node: span class=product-price-amount $27.00/span The parser first got the Node for span, which has only one child node as a text Node. I would assume this text Node has value $27.00, but when I called getNodeValue() the return value is empty. I forced this child node to be Text node and called getWholeText() but still get empty return value. Does anyone know what's going on? It seems that the text $27.00 seems to be missing from the whole hierarchy. Jun
Re: parsing a simple text node
Hi Jun, Could it be that the price is set by JavaScript at the moment of display in your browser? In that case the price is actually in some datasource (xml) or a separate .js file. This is sometimes done when pages need to be displayed in several browses like iPhone's and regular browsers. Did you try using an XPath expression? in your case it would be //span@product-price-amount. There are some good firefox addons to test XPaths on HTML. I use XPather. Regards, Evert Van: Jun Yang jun...@gmail.com Aan: dev@nutch.apache.org Verzonden: Dinsdag 8 februari 2011 09:16:50 Onderwerp: parsing a simple text node Hi there, i am working on a plugin to fetch some structured information (e.g., product price) in web pages, and I had some problem parsing the following simple node: span class = product-price-amount $27.00/ span The parser first got the Node for span, which has only one child node as a text Node. I would assume this text Node has value $27.00, but when I called getNodeValue() the return value is empty. I forced this child node to be Text node and called getWholeText() but still get empty return value. Does anyone know what's going on? It seems that the text $27.00 seems to be missing from the whole hierarchy. Jun
[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis updated NUTCH-965: - Attachment: parserJob.patch In the parser mapper, compare Content-Length header to the size of the content buffer to see if they match. If this HTTP header is available and in the case that the file was truncated, skip the parsing step to avoid that the parser gets stuck in infinite loop taking up all the CPU resources. Before, in the logs, we would see: {noformat}2011-02-07 14:03:34,693 WARN parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb1.flv with org.apache.nutch.parse.tika.TikaParser@8c0162 2011-02-07 14:03:34,693 WARN parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb1.flv of type video/x-flv 2011-02-07 14:04:04,725 WARN parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/dtj.flv with org.apache.nutch.parse.tika.TikaParser@8c0162 2011-02-07 14:04:04,725 WARN parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/dtj.flv of type video/x-flv 2011-02-07 14:04:34,772 WARN parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb2.flv with org.apache.nutch.parse.tika.TikaParser@8c0162 2011-02-07 14:04:34,772 WARN parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb2.flv of type video/x-flv {noformat} After: {noformat}2011-02-08 09:06:54,482 INFO parse.ParserJob - http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was truncated to 63980 2011-02-08 09:06:54,482 INFO parse.ParserJob - http://downtownjoes.com/dtj.flv skipped. Content of size 2692082 was truncated to 63980 2011-02-08 09:06:54,482 INFO parse.ParserJob - http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was truncated to 61058 {noformat} Parsing takes up 100% CPU - Key: NUTCH-965 URL: https://issues.apache.org/jira/browse/NUTCH-965 Project: Nutch Issue Type: Improvement Components: parser Reporter: Alexis Attachments: parserJob.patch The issue you're likely to run into when parsing truncated FLV files is described here: http://www.mail-archive.com/user@nutch.apache.org/msg01880.html The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Hudson: Nutch-trunk #1393
See https://hudson.apache.org/hudson/job/Nutch-trunk/1393/ -- [...truncated 1008 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A