parsing a simple text node

2011-02-08 Thread Jun Yang
Hi there,

i am working on a plugin to fetch some structured information (e.g., product
price) in web pages, and I had some problem parsing the following simple
node:

span class=product-price-amount
 $27.00/span

The parser first got the Node for span, which has only one child node as a
text Node. I would assume this text Node has value $27.00, but when I
called getNodeValue() the return value is empty. I forced this child node to
be Text node and called getWholeText() but still get empty return value.

Does anyone know what's going on? It seems that the text $27.00 seems to
be missing from the whole hierarchy.

Jun


Re: parsing a simple text node

2011-02-08 Thread Evert Wagenaar
Hi Jun, 

Could it be that the price is set by JavaScript at the moment of display in 
your browser? In that case the price is actually in some datasource (xml) or a 
separate .js file. This is sometimes done when pages need to be displayed in 
several browses like iPhone's and regular browsers. 

Did you try using an XPath expression? in your case it would be 
//span@product-price-amount. There are some good firefox addons to test XPaths 
on HTML. I use XPather. 

Regards, 

Evert 




Van: Jun Yang jun...@gmail.com 
Aan: dev@nutch.apache.org 
Verzonden: Dinsdag 8 februari 2011 09:16:50 
Onderwerp: parsing a simple text node 

Hi there, 

i am working on a plugin to fetch some structured information (e.g., product 
price) in web pages, and I had some problem parsing the following simple node: 

 span class = product-price-amount  
$27.00/ span  
The parser first got the Node for span, which has only one child node as a 
text Node. I would assume this text Node has value $27.00, but when I called 
getNodeValue() the return value is empty. I forced this child node to be Text 
node and called getWholeText() but still get empty return value. 

Does anyone know what's going on? It seems that the text $27.00 seems to be 
missing from the whole hierarchy. 

Jun 





  

[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU

2011-02-08 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-965:
-

Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content 
buffer to see if they match.

If this HTTP header is available and in the case that the file was truncated, 
skip the parsing step to avoid that the parser gets stuck in infinite loop 
taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb1.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/dtj.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb2.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was 
truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv 
skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was 
truncated to 61058
{noformat} 




 Parsing takes up 100% CPU
 -

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
 Attachments: parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Hudson: Nutch-trunk #1393

2011-02-08 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1393/

--
[...truncated 1008 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A