Mike Smith wrote:

Then I tried a local crawl using these URLS and put some logging at
RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/)
takes more than 10 min. The problem is that java script parser parses some
bogus links like this:



http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>>
………



These links are very very long and they have lots of / in it. These links
are created from scripts like this:


Yes, JS parser needs to be fixed - it's been on my TODO list for a long time now, but my todo list is very long nowadays ... if someone else wants to give it a try I won't object ;)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to