Hi it is my craw log, maybe with this somebody can hel me: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
050923 123817 No NutchFileSystem indicated, so defaulting to local fs. 050923 123817 loading file:/usr/nutch-0.6/conf/nutch-default.xml 050923 123818 loading file:/usr/nutch-0.6/conf/crawl-tool.xml 050923 123818 loading file:/usr/nutch-0.6/conf/nutch-site.xml 050923 123818 crawl started in: indici 050923 123818 rootUrlFile = url 050923 123818 threads = 8 050923 123818 depth = 10 050923 123818 Created webdb at LocalFS,/usr/ricerca/indici/db 050923 123818 Starting URL processing 050923 123818 Using URL filter: net.nutch.net.RegexURLFilter 050923 123818 found resource regex-urlfilter.txt at file:/usr/nutch-0.6/conf/regex-urlfilter.txt 050923 123818 Using URL normalizer: net.nutch.net.BasicUrlNormalizer 050923 123818 Added 1 pages 050923 123818 Processing pagesByURL: Sorted 1 instructions in 0.0030 seconds. 050923 123818 Processing pagesByURL: Sorted 333.3333333333333 instructions/second 050923 123818 Processing pagesByURL: Merged to new DB containing 1 records in 0.0010 seconds 050923 123818 Processing pagesByURL: Merged 1000.0 records/second 050923 123818 Processing pagesByMD5: Sorted 1 instructions in 0.0020 seconds. 050923 123818 Processing pagesByMD5: Sorted 500.0 instructions/second 050923 123818 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0 seconds 050923 123818 Processing pagesByMD5: Merged Infinity records/second 050923 123818 Processing linksByMD5: Copied file (4096 bytes) in 0.0030 secs. 050923 123818 Processing linksByURL: Copied file (4096 bytes) in 0.0010 secs. 050923 123818 FetchListTool started 050923 123818 Processing pagesByURL: Sorted 1 instructions in 0.0020 seconds. 050923 123818 Processing pagesByURL: Sorted 500.0 instructions/second 050923 123818 Processing pagesByURL: Merged to new DB containing 1 records in 0.0010 seconds 050923 123818 Processing pagesByURL: Merged 1000.0 records/second 050923 123818 Processing pagesByMD5: Sorted 1 instructions in 0.022 seconds. 050923 123818 Processing pagesByMD5: Sorted 45.45454545454546 instructions/second 050923 123818 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0 seconds 050923 123818 Processing pagesByMD5: Merged Infinity records/second 050923 123818 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs. 050923 123818 Processing linksByURL: Copied file (4096 bytes) in 0.0 secs. 050923 123818 Processing /usr/ricerca/indici/segments/20050923123818/fetchlist.unsorted: Sorted 1 entries in 0.0010 seconds. 050923 123818 Processing /usr/ricerca/indici/segments/20050923123818/fetchlist.unsorted: Sorted 1000.0 entries/second 050923 123818 Overall processing: Sorted 1 entries in 0.0010 seconds. 050923 123818 Overall processing: Sorted 0.0010 entries/second 050923 123818 FetchListTool completed 050923 123818 Plugins: looking in: /usr/nutch-0.6/build/plugins 050923 123818 parsing: /usr/nutch-0.6/build/plugins/parse-pdf/plugin.xml 050923 123818 impl: point=net.nutch.parse.Parser class=net.nutch.parse.pdf.PdfParser 050923 123818 parsing: /usr/nutch-0.6/build/plugins/index-basic/plugin.xml 050923 123818 impl: point=net.nutch.indexer.IndexingFilter class=net.nutch.indexer.basic.BasicIndexingFilter 050923 123818 not including: /usr/nutch-0.6/build/plugins/clustering-carrot2 050923 123818 not including: /usr/nutch-0.6/build/plugins/query-site 050923 123818 not including: /usr/nutch-0.6/build/plugins/parse-ext 050923 123818 parsing: /usr/nutch-0.6/build/plugins/index-more/plugin.xml 050923 123818 impl: point=net.nutch.indexer.IndexingFilter class=net.nutch.indexer.more.MoreIndexingFilter 050923 123818 parsing: /usr/nutch-0.6/build/plugins/parse-html/plugin.xml 050923 123818 impl: point=net.nutch.parse.Parser class=net.nutch.parse.html.HtmlParser 050923 123818 not including: /usr/nutch-0.6/build/plugins/query-basic 050923 123818 not including: /usr/nutch-0.6/build/plugins/ontology 050923 123818 parsing: /usr/nutch-0.6/build/plugins/parse-text/plugin.xml 050923 123818 impl: point=net.nutch.parse.Parser class=net.nutch.parse.text.TextParser 050923 123818 not including: /usr/nutch-0.6/build/plugins/protocol-https 050923 123818 parsing: /usr/nutch-0.6/build/plugins/parse-rtf/plugin.xml 050923 123818 impl: point=net.nutch.parse.Parser class=net.nutch.parse.rtf.RTFParseFactory 050923 123818 parsing: /usr/nutch-0.6/build/plugins/parse-msword/plugin.xml 050923 123818 impl: point=net.nutch.parse.Parser class=net.nutch.parse.msword.MSWordParser 050923 123818 not including: /usr/nutch-0.6/build/plugins/parse-mp3 050923 123818 parsing: /usr/nutch-0.6/build/plugins/protocol-file/plugin.xml 050923 123818 impl: point=net.nutch.protocol.Protocol class=net.nutch.protocol.file.File 050923 123818 not including: /usr/nutch-0.6/build/plugins/creativecommons 050923 123818 parsing: /usr/nutch-0.6/build/plugins/protocol-ftp/plugin.xml 050923 123818 impl: point=net.nutch.protocol.Protocol class=net.nutch.protocol.ftp.Ftp 050923 123818 not including: /usr/nutch-0.6/build/plugins/query-url 050923 123818 not including: /usr/nutch-0.6/build/plugins/language-identifier 050923 123818 parsing: /usr/nutch-0.6/build/plugins/protocol-http/plugin.xml 050923 123818 impl: point=net.nutch.protocol.Protocol class=net.nutch.protocol.http.Http 050923 123818 logging at FINE 050923 123818 logging at INFO 050923 123818 fetching http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123818 http.proxy.host = null 050923 123818 http.proxy.port = 8080 050923 123818 http.timeout = 30000 050923 123818 http.content.limit = -1 050923 123818 http.agent = NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; [EMAIL PROTECTED]) 050923 123818 fetcher.server.delay = 5000 050923 123818 http.max.delays = 3 050923 123818 fetching http://aaa.bbb.ccc/robots.txt 050923 123818 fetched 329 bytes from http://aaa.bbb.ccc/robots.txt 050923 123818 fetching http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123828 fetched 0 bytes from http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123828 redirect to http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=7DB76DA9D2315C2D3B1F2775CDF99524.tomcat1 050923 123833 fetching http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=7DB76DA9D2315C2D3B1F2775CDF99524.tomcat1 050923 123833 fetched 254 bytes from http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=7DB76DA9D2315C2D3B1F2775CDF99524.tomcat1 050923 123833 status: segment 20050923123818, 1 pages, 0 errors, 254 bytes, 15033 ms 050923 123833 status: 0.06652032 pages/s, 0.13200127 kb/s, 254.0 bytes/page 050923 123834 Updating /usr/ricerca/indici/db 050923 123834 Updating for /usr/ricerca/indici/segments/20050923123818 050923 123834 Processing document 0 050923 123834 Finishing update 050923 123834 Processing pagesByURL: Sorted 3 instructions in 0.0020 seconds. 050923 123834 Processing pagesByURL: Sorted 1500.0 instructions/second 050923 123834 Processing pagesByURL: Merged to new DB containing 3 records in 0.0 seconds 050923 123834 Processing pagesByURL: Merged Infinity records/second 050923 123834 Processing pagesByMD5: Sorted 4 instructions in 0.0020 seconds. 050923 123834 Processing pagesByMD5: Sorted 2000.0 instructions/second 050923 123834 Processing pagesByMD5: Merged to new DB containing 3 records in 0.0 seconds 050923 123834 Processing pagesByMD5: Merged Infinity records/second 050923 123834 Processing linksByMD5: Sorted 3 instructions in 0.0 seconds. 050923 123834 Processing linksByMD5: Sorted Infinity instructions/second 050923 123834 Processing linksByMD5: Merged to new DB containing 2 records in 0.0010 seconds 050923 123834 Processing linksByMD5: Merged 2000.0 records/second 050923 123834 Processing linksByURL: Sorted 2 instructions in 0.0030 seconds. 050923 123834 Processing linksByURL: Sorted 666.6666666666666 instructions/second 050923 123834 Processing linksByURL: Merged to new DB containing 2 records in 0.0010 seconds 050923 123834 Processing linksByURL: Merged 2000.0 records/second 050923 123834 Processing linksByMD5: Sorted 2 instructions in 0.0010 seconds. 050923 123834 Processing linksByMD5: Sorted 2000.0 instructions/second 050923 123834 Processing linksByMD5: Merged to new DB containing 2 records in 0.0010 seconds 050923 123834 Processing linksByMD5: Merged 2000.0 records/second 050923 123834 Update finished 050923 123834 FetchListTool started 050923 123834 Processing pagesByURL: Sorted 2 instructions in 0.0020 seconds. 050923 123834 Processing pagesByURL: Sorted 1000.0 instructions/second 050923 123834 Processing pagesByURL: Merged to new DB containing 3 records in 0.0 seconds 050923 123834 Processing pagesByURL: Merged Infinity records/second 050923 123834 Processing pagesByMD5: Sorted 2 instructions in 0.0010 seconds. 050923 123834 Processing pagesByMD5: Sorted 2000.0 instructions/second 050923 123834 Processing pagesByMD5: Merged to new DB containing 3 records in 0.0 seconds 050923 123834 Processing pagesByMD5: Merged Infinity records/second 050923 123834 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs. 050923 123834 Processing linksByURL: Copied file (4096 bytes) in 0.0010 secs. 050923 123834 Processing /usr/ricerca/indici/segments/20050923123834/fetchlist.unsorted: Sorted 2 entries in 0.0 seconds. 050923 123834 Processing /usr/ricerca/indici/segments/20050923123834/fetchlist.unsorted: Sorted Infinity entries/second 050923 123834 Overall processing: Sorted 2 entries in 0.0 seconds. 050923 123834 Overall processing: Sorted 0.0 entries/second 050923 123834 FetchListTool completed 050923 123834 logging at INFO 050923 123834 fetching http://aaa.bbb.ccc/dir/benvenuto.html 050923 123834 fetching http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123838 fetching http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123847 fetched 0 bytes from http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123847 redirect to http://aaa.bbb.ccc/dir/servlet/../index.html 050923 123848 fetch of http://aaa.bbb.ccc/dir/benvenuto.html failed with: net.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 050923 123852 fetching http://aaa.bbb.ccc/dir/servlet/../index.html 050923 123852 fetched 1560 bytes from http://aaa.bbb.ccc/dir/servlet/../index.html 050923 123852 status: segment 20050923123834, 1 pages, 1 errors, 1560 bytes, 18037 ms 050923 123852 status: 0.055441592 pages/s, 0.6756944 kb/s, 1560.0 bytes/page 050923 123853 Updating /usr/ricerca/indici/db 050923 123853 Updating for /usr/ricerca/indici/segments/20050923123834 050923 123853 Processing document 0 050923 123853 Finishing update 050923 123853 Processing pagesByURL: Sorted 2 instructions in 0.0 seconds. 050923 123853 Processing pagesByURL: Sorted Infinity instructions/second 050923 123853 Processing pagesByURL: Merged to new DB containing 3 records in 0.0010 seconds 050923 123853 Processing pagesByURL: Merged 3000.0 records/second 050923 123853 Processing pagesByMD5: Sorted 3 instructions in 0.0010 seconds. 050923 123853 Processing pagesByMD5: Sorted 3000.0 instructions/second 050923 123853 Processing pagesByMD5: Merged to new DB containing 3 records in 0.0010 seconds 050923 123853 Processing pagesByMD5: Merged 3000.0 records/second 050923 123853 Processing linksByMD5: Sorted 1 instructions in 0.0020 seconds. 050923 123853 Processing linksByMD5: Sorted 500.0 instructions/second 050923 123853 Processing linksByMD5: Merged to new DB containing 2 records in 0.0010 seconds 050923 123853 Processing linksByMD5: Merged 2000.0 records/second 050923 123853 Processing linksByURL: Copied file (4096 bytes) in 0.0 secs. 050923 123854 Update finished 050923 123854 FetchListTool started 050923 123854 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123854 Overall processing: Sorted NaN entries/second 050923 123854 FetchListTool completed 050923 123854 logging at INFO 050923 123855 Updating /usr/ricerca/indici/db 050923 123855 Updating for /usr/ricerca/indici/segments/20050923123854 050923 123855 Finishing update 050923 123855 Update finished 050923 123855 FetchListTool started 050923 123855 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123855 Overall processing: Sorted NaN entries/second 050923 123855 FetchListTool completed 050923 123855 logging at INFO 050923 123856 Updating /usr/ricerca/indici/db 050923 123856 Updating for /usr/ricerca/indici/segments/20050923123855 050923 123856 Finishing update 050923 123856 Update finished 050923 123856 FetchListTool started 050923 123856 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123856 Overall processing: Sorted NaN entries/second 050923 123856 FetchListTool completed 050923 123856 logging at INFO 050923 123857 Updating /usr/ricerca/indici/db 050923 123857 Updating for /usr/ricerca/indici/segments/20050923123856 050923 123857 Finishing update 050923 123857 Update finished 050923 123857 FetchListTool started 050923 123857 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123857 Overall processing: Sorted NaN entries/second 050923 123857 FetchListTool completed 050923 123857 logging at INFO 050923 123858 Updating /usr/ricerca/indici/db 050923 123858 Updating for /usr/ricerca/indici/segments/20050923123857 050923 123858 Finishing update 050923 123858 Update finished 050923 123858 FetchListTool started 050923 123858 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123858 Overall processing: Sorted NaN entries/second 050923 123858 FetchListTool completed 050923 123858 logging at INFO 050923 123859 Updating /usr/ricerca/indici/db 050923 123859 Updating for /usr/ricerca/indici/segments/20050923123858 050923 123859 Finishing update 050923 123859 Update finished 050923 123859 FetchListTool started 050923 123859 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123859 Overall processing: Sorted NaN entries/second 050923 123859 FetchListTool completed 050923 123859 logging at INFO 050923 123900 Updating /usr/ricerca/indici/db 050923 123900 Updating for /usr/ricerca/indici/segments/20050923123859 050923 123900 Finishing update 050923 123900 Update finished 050923 123900 FetchListTool started 050923 123900 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123900 Overall processing: Sorted NaN entries/second 050923 123900 FetchListTool completed 050923 123900 logging at INFO 050923 123901 Updating /usr/ricerca/indici/db 050923 123901 Updating for /usr/ricerca/indici/segments/20050923123900 050923 123901 Finishing update 050923 123901 Update finished 050923 123901 FetchListTool started 050923 123901 Overall processing: Sorted 0 entries in 0.0 seconds. 050923 123901 Overall processing: Sorted NaN entries/second 050923 123901 FetchListTool completed 050923 123901 logging at INFO 050923 123902 Updating /usr/ricerca/indici/db 050923 123902 Updating for /usr/ricerca/indici/segments/20050923123901 050923 123902 Finishing update 050923 123902 Update finished 050923 123902 FetchListTool started 050923 123902 Processing pagesByURL: Sorted 3 instructions in 0.0010 seconds. 050923 123902 Processing pagesByURL: Sorted 3000.0 instructions/second 050923 123902 Processing pagesByURL: Merged to new DB containing 3 records in 0.0010 seconds 050923 123902 Processing pagesByURL: Merged 3000.0 records/second 050923 123902 Processing pagesByMD5: Sorted 3 instructions in 0.0020 seconds. 050923 123902 Processing pagesByMD5: Sorted 1500.0 instructions/second 050923 123902 Processing pagesByMD5: Merged to new DB containing 3 records in 0.0 seconds 050923 123902 Processing pagesByMD5: Merged Infinity records/second 050923 123902 Processing linksByMD5: Copied file (4096 bytes) in 0.0 secs. 050923 123902 Processing linksByURL: Copied file (4096 bytes) in 0.0010 secs. 050923 123902 Processing /usr/ricerca/indici/segments/20050923123902/fetchlist.unsorted: Sorted 3 entries in 0.0020 seconds. 050923 123902 Processing /usr/ricerca/indici/segments/20050923123902/fetchlist.unsorted: Sorted 1500.0 entries/second 050923 123902 Overall processing: Sorted 3 entries in 0.0020 seconds. 050923 123902 Overall processing: Sorted 6.666666666666666E-4 entries/second 050923 123902 FetchListTool completed 050923 123902 logging at INFO 050923 123902 fetching http://aaa.bbb.ccc/dir/benvenuto.html 050923 123902 fetching http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123902 fetching http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123902 fetching http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123902 fetched 0 bytes from http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 050923 123902 redirect to http://aaa.bbb.ccc/dir/servlet/../index.html 050923 123907 fetching http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123908 fetched 0 bytes from http://aaa.bbb.ccc/dir/servlet/ServletSession?Password=xxx&Login=yyy 050923 123908 redirect to http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=0114057133B7ED0ED272FE434426F149.tomcat1 050923 123912 fetch of http://aaa.bbb.ccc/dir/benvenuto.html failed with: net.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 050923 123913 fetching http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=0114057133B7ED0ED272FE434426F149.tomcat1 050923 123913 fetch of http://aaa.bbb.ccc/dir/servlet/ServletMenu?menu=1 failed with: net.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 050923 123913 fetched 254 bytes from http://aaa.bbb.ccc/dir/servlet/ServletAllFrame;jsessionid=0114057133B7ED0ED272FE434426F149.tomcat1 050923 123913 status: segment 20050923123902, 1 pages, 2 errors, 254 bytes, 11022 ms 050923 123913 status: 0.090727635 pages/s, 0.18003765 kb/s, 254.0 bytes/page 050923 123914 indexing segment: /usr/ricerca/indici/segments/20050923123902 050923 123915 * Opening segment 20050923123902 050923 123915 * Indexing segment 20050923123902 050923 123915 found resource mime.types at file:/usr/nutch-0.6/conf/mime.types 050923 123915 found resource common-terms.utf8 at file:/usr/nutch-0.6/conf/common-terms.utf8 050923 123915 * Optimizing index... 050923 123915 * Moving index to NFS if needed... 050923 123915 DONE indexing segment 20050923123902: total 3 records in 0.217 s (Infinity rec/s). 050923 123915 done indexing 050923 123915 Reading url hashes... 050923 123915 Sorting url hashes... 050923 123915 Deleting url duplicates... 050923 123915 Deleted 0 url duplicates. 050923 123915 Reading content hashes... 050923 123915 Sorting content hashes... 050923 123915 Deleting content duplicates... 050923 123915 Deleted 0 content duplicates. 050923 123915 Duplicate deletion complete locally. Now returning to NFS... 050923 123915 DeleteDuplicates complete 050923 123915 merging segment indexes to: /usr/ricerca/indici/index 050923 123915 merging segments _1 (1 docs) into _0 (1 docs) 050923 123915 done merging 050923 123915 crawl finished: indici <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< please help me!!! Adriano Palombo >You can use "nutch readdb" command to check if urls you are interested >in where added to WebDB - if yes check the segments if they contain >these urls. Please review the logs from fetch to check if there was an >attempt to fetch from these urls (you might have some problem with >authentication). Right now the description is too generic for me to help >with more details. >Regards >Piotr > >[EMAIL PROTECTED] wrote: >> Hi, I have a question about nutch crawler: >> >> >> >> I want to make a document search on a site one that has approached with >authentication (user/password). > As soon as fact the login, the first page >visualized from the composed application e' from two frame: > >> <HTML> >> <HEAD> >> <TITLE>Sistema Provvedimenti - SUPER</TITLE> >> </HEAD> <FRAMESET ROWS="14%,*"> >> <FRAME NORESIZE NAME="MENU" SRC="Servlet1?menu=1" SCROLLING="AUTO"> >> <FRAME NAME="PAGE" SRC="../a.html" SCROLLING="AUTO"> >> </FRAMESET> >> </HTML> >> >> The servlet "Servlet1" publish on web a table with a 1 line and N columns, >> > >where every column contains a href with the URL of an other servlet (a >Servlet2-ServletN). > >> DESCRIPTION OF THE PROBLEM: >> >> My problem is that I ago see that crawler make the fetch of the page of >> login, >of the static page a.html, of servlet the Servlet1, but not ago fetch of no >the >other servlet (Servlet2-ServletN). > Instead if I put of the href in the page >a.html, Nutch succeeds to make the fetch of the URL and works all. > >> >> DESCRIPTION OF OUR CONFIGURATION OF NUTCH: >> I installed Nutch 0.6. I launch the nutch in this mode: >> /usr/nutch-0.6/bin/nutch crawl url -dir index -depth 10 -threads 8 >& >> crawl.log >> >> where in the file "url" there is only the url of the sie with just the login >and passw > >> I modified the file of configuration of Nutch "crawl-urlfilter.txt" like : >> > >> -^(ftp|mailto): >> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m >> > >ov|MOV|exe)$ >> +[?&=] >> +. >> >> >> Plese somebody help me!!! It is very important for me >> >> Adriano Palombo >> > >> > ------------------------------------------------------------------------- Visita http://domini.interfree.it, il sito di Interfree dove trovare soluzioni semplici e complete che soddisfano le tue esigenze in Internet, ecco due esempi di offerte: - Registrazione Dominio: un dominio con 1 MB di spazio disco + 2 caselle email a soli 18,59 euro - MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email a soli 51,13 euro Vieni a trovarci! Lo Staff di Interfree -------------------------------------------------------------------------
