Mike Smith wrote:
Then I tried a local crawl using these URLS and put some logging at
RegexURLFilter.java:86, I could catch the Regex (-.*(/.+?)/.*?\1/.*?\1/)
takes more than 10 min. The problem is that java script parser parses
some
bogus links like this:
http://www.discountedboots.com/<SELECT%20%20NAME%3D%22EDIT_BROWSE%22<http://www.discountedboots.com/%3cSELECT%20%20NAME%3D%22EDIT_BROWSE%22>>
………
These links are very very long and they have lots of / in it. These links
are created from scripts like this:
Yes, JS parser needs to be fixed - it's been on my TODO list for a long
time now, but my todo list is very long nowadays ... if someone else
wants to give it a try I won't object ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com