Update of /cvsroot/nutch/nutch/src/plugin/parse-html/src/java/net/nutch/parse/html In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32410
Modified Files: HtmlParser.java Log Message: Do less error checking, in particular on attribute names. This enables us to parse a lot more documents. Index: HtmlParser.java =================================================================== RCS file: /cvsroot/nutch/nutch/src/plugin/parse-html/src/java/net/nutch/parse/html/HtmlParser.java,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** HtmlParser.java 23 Aug 2004 19:16:27 -0000 1.8 --- HtmlParser.java 15 Oct 2004 18:49:39 -0000 1.9 *************** *** 112,116 **** // parse the content ! DocumentFragment root = new HTMLDocumentImpl().createDocumentFragment(); try { byte[] contentInOctets = content.getContent(); --- 112,118 ---- // parse the content ! HTMLDocumentImpl impl = new HTMLDocumentImpl(); ! impl.setErrorChecking(false); ! DocumentFragment root = impl.createDocumentFragment(); try { byte[] contentInOctets = content.getContent(); *************** *** 156,159 **** --- 158,163 ---- } catch (IOException e) { throw new ParseException(e); + } catch (DOMException e) { + throw new ParseException(e); } catch (SAXException e) { throw new ParseException(e); ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-cvs mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-cvs