Thanks -- I will probably not be able to get to this further until tonight anyhow.
Karl On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > > I tried to fetch documents by using curl from our prod server just in case > a webmaster had blocked access. No problem. Maybe I should ask the > webmaster of that host anyway, just to be sure. > > The interrupted message may have been caused by an abort of that job. > > I think I should just stop the problematic job and start all the other > three remaining jobs instead. I bet they will all complete. Ideally we > shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I > have just contacted someone who is indexing Dspace resources. I guess a > Dspace connector is a better approach. > > Below you'll find some parameters. > > REPOSITORY CONNECTION > --------------------- > Throttling -> max connections: 30 > Throttling -> Max fetches/min: 100 > Bandwith -> max connections: 25 > Bandwith -> max kbytes/sec: 8000 > Bandwith -> max fetches/min: 20 > > JOB SETTINGS > ------------ > > Hop filters: Keep forever > > Seeds: https://www.duo.uio.no/ > > Exclude from crawl: > # Exclude some file types: > \.gif$ > \.GIF$ > \.jpeg$ > \.JPEG$ > \.jpg$ > \.JPG$ > \.png$ > \.PNG$ > \.mpg$ > \.MPG$ > \.mpeg$ > \.MPEG$ > \.exe$ > \.bmp$ > \.BMP$ > \.mov$ > \.MOV$ > \.wmf$ > \.css$ > \.ico$ > \.ICO$ > \.mp2$ > \.mp3$ > \.mp4$ > \.wmv$ > \.tif$ > \.tiff$ > \.avi$ > \.ogg$ > \.ogv$ > \.zip$ > \.gz$ > \.psd$ > > # TIKA-1011 > \.mhtml$ > > # Exclude log files: > \.log$ > \.logfile$ > > # Generelt, ikke tillatt indeksering av DUO-søkeresultater: > https?://www\.duo\.uio\.no/sok/search.* > > # Andre elementer i DUO som skal ekskluderes: > https://www\.duo\.uio\.no.*open-search/description\.xml$ > https://www\.duo\.uio\.no/(inn|login|feed|search| > advanced-search|community-list|browse|password-login|inn|discover).* > > # Skip locale settings - makes duplicates: > https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ > > # Temporarily skip PDFs since we are indexing abstracts: > https://www\.duo\.uio\.no/bitstream/handle/.+ > > # skip full item record: > https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ > # ny url-struktur: > https://www\.duo\.uio\.no/handle/.*\?show=full$ > > # Skip all navigations but "start with letter": > https://www\.duo\.uio\.no/.*type=(author|dateissued)$ > > # Skip search: > #https://www\.duo\.uio\.no/handle/.*/discover\?.* > https://www\.duo\.uio\.no/handle/.*search-filter\?.* > # ny url-struktur: > https://www\.duo\.uio\.no/discover\?.* > https://www\.duo\.uio\.no/search-filter\?.* > > # Skip statistics: > https://www\.duo\.uio\.no/handle/.*/statistics$ > > Exclude from index: > # Exclude front page - no valuable info and we have QL: > https?://www\.duo\.uio\.no/$ > > # Do not index navigation, but follow: > https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ > #ny url-struktur: > https://www\.duo\.uio\.no/handle/\d+/\d+/.+ > > # Exclude id's lower than four, probably category listening: > https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ > # ny url-strultur: > https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ > > Thanks for looking at this! > > BTW: Within an hour, I will be away from my computer and cannot test > anymore until Monday. I'm leaving Oslo for some days, but I will still be > able to read and answer emails. > > Erlend > > > On 18.09.14 13:43, Karl Wright wrote: > >> Hi Erlend, >> >> The "Interrupted: null" message with a -104 code means only that the fetch >> was interrupted by something. Unfortunately, the message is not clear >> about what the cause of the interruption is. This is unrelated to >> Zookeeper; but I agree that it is suspicious that many such interruptions >> appear right after robots is parsed. >> >> One cause of a -104 is when the target server forcibly drops the >> connection, so an InterruptedIOException is thrown. Having a look at the >> timestamps for the fetch messages, it looks believable that you might have >> exceeded some predetermined limit on that machine. They're all within a >> few milliseconds of each other. When a robots file needs to be read, >> ManifoldCF creates an event for that, and the urls blocked by that event >> will all be 'fetchable' as soon as the event is released. Perhaps your >> throttling needs to be adjusted now that the rate limit bug has been >> fixed? >> >> I won't be able to work with this without at least your crawling >> parameters >> for the server in question. I can ping that server so if you would like I >> can try crawling that server from here. >> >> For zookeeper, I would still try to either increase your tick count to >> maybe 10000, or better yet, find out why you periodically lose the ability >> to transmit pings from MCF to your zookeeper process. >> >> Thanks, >> Karl >> >> >> >> >> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> >> wrote: >> >> On 18.09.14 13:00, Karl Wright wrote: >>> >>> Hi Erlend, >>>> >>>> please can you also add the manifoldcf log as well? >>>> >>>> >>> Yes, I will, but it includes entries from RC0 as well. >>> >>> MCF works perfectly using the other jobs for the other hosts. Take a look >>> at the following once again. MCF is being interrupted: >>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL| >>> https://www.duo.uio.no/|1411030940209+682605|-104| >>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| >>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104% >>> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C> >>> Interrupted: Interrupted: null >>> >>> You can find this entry near the other regarding the robots.txt file: >>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log >>> >>> Erlend >>> >>> >>> >> >