Hello,

I had a look at this issue this week and I think I have identified the problem. I have planned to prepare a patch for it but first I have to find out the correct way of submitting a patch for nutch. I will describe the problem briefly here so you can fix it quickly yourself and later I will try to do it formally.
This is multithreading problem connected with usage of jakarta-oro library. The code is located in BasicUrlNormalizer class.
One object of this class is used from multiple fetcher threads.
From oro Javadoc:


"Perl5Compiler and Perl5Matcher are designed with the intent that you use a separate instance of each per thread to avoid the overhead of both synchronization and concurrent access (e.g., a match that takes a long time in one thread will block the progress of another thread with a shorter match). If you want to use a single instance of each in a concurrent program, you must appropriately protect access to the instances with critical sections. If you want to share Perl5Pattern instances between concurrently executing instances of Perl5Matcher, you must compile the patterns with |READ_ONLY_MASK| <../../../../../org/apache/oro/text/regex/Perl5Compiler.html#READ_ONLY_MASK>."

To eliminate the bug you have to change constructor and substituteUnnecessaryRelativePaths method:
1) Constructor:
relativePathRule.pattern = (Perl5Pattern)
compiler.compile("(/[^/]*[^/.]{1}[^/]*/\\.\\./)");
into :


relativePathRule.pattern = (Perl5Pattern)
compiler.compile("(/[^/]*[^/.]{1}[^/]*/\\.\\./)",Perl5Compiler.READ_ONLY_MASK);


and

leadingRelativePathRule.pattern = (Perl5Pattern)
compiler.compile("^(/\\.\\./)+");
into:
leadingRelativePathRule.pattern = (Perl5Pattern)
compiler.compile("^(/\\.\\./)+",Perl5Compiler.READ_ONLY_MASK);
2) substituteUnnecessaryRelativePaths() - add local matcher object (you can removed unused matcher field then)


            oldLen = fileWorkCopy.length();
           fileWorkCopy = Util.substitute
             (matcher, relativePathRule.pattern,
              relativePathRule.substitution, fileWorkCopy, 1);
into:
          oldLen = fileWorkCopy.length();
           Perl5Matcher matcher=new Perl5Matcher();
           fileWorkCopy = Util.substitute
             (matcher, relativePathRule.pattern,
              relativePathRule.substitution, fileWorkCopy, 1);

This helped to eliminate majority of ArrayIndexOutOfBounds exceptions in my case. I have still a few that happen during parsing in NekoHTML -but they happen not so often and I will have a look at them later.

The same oro usage pattern is also used in other nutch parts but the code is synchronized (and as I can tell oro usage is the only reason for synchronization so probably synchronization can be removed and oro calls adopted for multithreaded use) or it is not used from many threads.

Regards
Piotr Kosiorowski


John X wrote:

On Fri, Jan 07, 2005 at 05:27:36PM -0800, Tim England wrote:


I'm trying to track down a bunch of exceptions thrown during the fetch
phase.  For example:

fetch of http://www.zdnet.com/ failed with:
java.lang.ArrayIndexOutOfBoundsException: 38



It may well be some codes are not thread safe. I had identified it a while a ago. Guess have to commit the patch using Doug's suggestion. Will try to have that fixed over the week end. Meanwhile you may want to search list archive. The thread was probably before Xmas.

John



Before I try to track this down myself by digging through Fetcher.java etc,
is there a simple config tweak I need to make?  I've only been using the
nightly builds, so perhaps I should just start using the 0.5 release build.

Comments highly appreciated...

Tim England





-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers



__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers







---------------------------------------------------------------------- Dzwon kilka razy taniej! >>> http://link.interia.pl/f1840



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to