cmarschner 2002/06/18 04:39:51 Modified: contributions/webcrawler-LARM TODO.txt Log: see file Revision Changes Path 1.2 +40 -13 jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt Index: TODO.txt =================================================================== RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt,v retrieving revision 1.1 retrieving revision 1.2 diff -u -r1.1 -r1.2 --- TODO.txt 1 Jun 2002 18:55:15 -0000 1.1 +++ TODO.txt 18 Jun 2002 11:39:51 -0000 1.2 @@ -1,11 +1,39 @@ Todos for 1.0 (not yet ordered in decreasing priority) -$id: $ +$Id$ + +----------------------------------------------------------------------------------------------- +solved: +----------------------------------------------------------------------------------------------- + +Bugs: + - some relative URLs are not appended appropriately, leading to wrong and growing URLs + - 301/302 URLs were not updated: the docs were saved under the old URL, which lead to + wrong relative URLs (cmarschner, 2002-06-17) + +URLs: + - include a URLNormalizer + * lowercase host names + * avoid ambiguities like '%20' / '+' + * make sure http://host URLs end with "/" + * avoid host name aliases + - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de + - two host names / one ip adress can point to different web sites (then other URLs / pages must differ) + suche.lmu.de / interesse.lmu.de + * cater 301/302 result codes + STATUS: seems to be solved except that URL parameters can occur in different orders, which is NOT resolved + host names are resolved by hand, via a synonym in HostManager. (cmarschner, 2002-06-17) + problem: URLMessage size doubles + +----------------------------------------------------------------------------------------------- +remaining: +----------------------------------------------------------------------------------------------- * Bugs - on very fast LAN connections (100MBit), sockets are not freed as fast as allocated - - some relative URLs are not appended appropriately, leading to wrong and growing URLs + probably this will be solved by changing from HTTPClient.* to Jakarta HTTP client and reuse sockets + * Build - added build.xml, but build.bat and build.sh are still working without ANT. Change that. @@ -16,16 +44,6 @@ * Configuration - move all configuration stuff into a meaningful properties file -* URLs: - - include a URLNormalizer - * lowercase host names - * avoid ambiguities like '%20' / '+' - * make sure http://host URLs end with "/" - * avoid host name aliases - - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de - - two host names / one ip adress can point to different web sites (then other URLs / pages must differ) - suche.lmu.de / interesse.lmu.de - * cater 301/302 result codes * Repository - optionally use a database as repository (caches, queues, logs) @@ -50,13 +68,22 @@ * Politeness - add the option to restrict the number of host accesses per hour/minute +* URL Extraction + - URLs can be encoded in different encoding styles - see http://www.unicode.org/unicode/faq/unicode_web.html + +* I18N, HTML encoding + - determine document encoding style in content-type, meta tag (http-equiv), or Doctype-tag; adapt URLs to + encoding style + * Anchor text extraction * read until a meaningful end tag, not just the first encountered * remove entities * optionally remove Tags, leave ALT attribute * remove redundant spaces - +* URLNormalizer + * add possibility to add synonyms to top level domains, i.e. "d1.com = d2.com" --> "sub1.d1.com = sub1.d2.com" + * add possibility to detect synonyms automatically, i.e. by comparing IP addresses or file checksums Nice-to-have:
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
