Update of /cvsroot/nutch/nutch/src/plugin/parse-html/src/java/net/nutch/parse/html
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32410

Modified Files:
        HtmlParser.java 
Log Message:
Do less error checking, in particular on attribute names.  This
enables us to parse a lot more documents.


Index: HtmlParser.java
===================================================================
RCS file: 
/cvsroot/nutch/nutch/src/plugin/parse-html/src/java/net/nutch/parse/html/HtmlParser.java,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** HtmlParser.java     23 Aug 2004 19:16:27 -0000      1.8
--- HtmlParser.java     15 Oct 2004 18:49:39 -0000      1.9
***************
*** 112,116 ****
      
      // parse the content
!     DocumentFragment root = new HTMLDocumentImpl().createDocumentFragment();
      try {
        byte[] contentInOctets = content.getContent();
--- 112,118 ----
      
      // parse the content
!     HTMLDocumentImpl impl = new HTMLDocumentImpl();
!     impl.setErrorChecking(false);
!     DocumentFragment root = impl.createDocumentFragment();
      try {
        byte[] contentInOctets = content.getContent();
***************
*** 156,159 ****
--- 158,163 ----
      } catch (IOException e) {
        throw new ParseException(e);
+     } catch (DOMException e) {
+       throw new ParseException(e);
      } catch (SAXException e) {
        throw new ParseException(e);



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-cvs mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to