With NUTCH-233 the issue is independent of Hadoop and lies with the
regex-urlfilter. The last solution posted in JIRA gives you more room to work
with, it allowed myself to fetch a segment over 1-2 million but I ran into the
same issue when the segment approached 10 million in size.
Unless you have another idea for what regular expression we should/can use,
your time would be better spent fixing another issue. Ive spent some time
plucking in different strings, but none of them worked.
Currently, I just comment out the whole line in the regex-urlfilter.txt file
and I don't notice any negative side-effects.
----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Wednesday, March 7, 2007 11:43:30 PM
Subject: Re: 0.9 release
> Dennis Kubes wrote:
>> I was looking through the JIRA to try and help create a list for this
>> release and to say the least it is a little overwhelming. It looks
>> like there are 183 issues total with 152 being unassigned. What has
>> been the current process for testing/committing issues that have
>> patches attached?
>
> Well, it was a bit haphazard .. :| mostly due to the fact that there
> were too few people to review and commit the patches on a timely basis.
> It's clear that we won't be able to close all 183 issues now. I think we
> must address blocker or critical issues, and we should address issues
> with a lot of votes.
I agree about trying to close all 183 issues now. This is longer term.
It may take awhile to get through all of them but I will do my best to
start reviewing and testing these patches and issues one by one starting
with major and working on down (I think we will have handled all blocker
and critical with this release).
I am assuming it is best to post results to the dev list to get some
consensus and then depending on that either commit or close. Might need a
little help just to work out process.
Dennis Kubes
>
>> I know that Andrzej said he had a list of patches for upgrading to
>> Hadoop 11.2. I know this includes NUTCH-437 among others.
>
> Done now.
>
>>
>> Then there are the issues discussed previously:
>>
>> NUTCH-400 blocker
>
> Already closed.
>
>> NUTCH-353 blocker
>
> There is no obvious fix for the remaining part of this issue - it's
> complicated ... I suggest to move this to Major, and go back to this
> issue after the release
>
>> NUTCH-233 blocker
>
> We should apply the fix, test, and if it works commit it. Any takers?
I will handle this test and report back. I am going to test it with new
hadoop version that you already committed.
>
>> NUTCH-436 critical
>
> Your patch looks good to me.
Should I go ahead and commit this or do we want to wait for release? Does
someone else need to test it?
>
>> NUTCH-381 critical
>
> We need a reproducible case in order to diagnose it.
Should I go ahead and close this one since we can't reproduce?
>
>> NUTCH-277 critical
>
> Unable to reproduce.
Close this one as well?
>
>> NUTCH-167 critical
>
> I will apply this before the release.
>
>>
>> NUTCH-427 (critical)suggest changing it from critical to major or even
>> trivial and then testing this out with the others in the jira for
>> later releases.
>
> Done.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers