Sami Siren wrote:


I'd like to solicit more comments on the impact of this solution, before going forward. I can apply the other simple whitespace-related change, though...


Why don't we do the whitespace removing in parsing stage once (in html case right place is perhaps DomContentUtils) instead of over and over again when creating fragments as Doug? pointed out earlier. the getText method in DomContentUtils has also another problem: it sometimes illegally concatenates strings (words) if they're separated only with some html tags and no whitespace.

I re-worked and applied the patch according to your suggestions. I also created some additional tests, among others to test for whitespace processing. Please check out the latest CVS version and see if it works for you.


--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to