Have you ever used uima?  Same software used on the IBM Watson project.  Very 
very powerful.

http://uima.apache.org/

Dan

Sent from my iPhone

On Oct 24, 2012, at 10:45 PM, Hilmar Lapp <[email protected]> wrote:

> The code is a very small snippet from a natural language processing software 
> aimed at extracting structured phenotype descriptions from un- or 
> semistructured free text. Apparently the code as is (in Perl) makes a lot of 
> regular expression matches, and so if the speed difference for them between 
> Perl and Java is significant, in theory this might become a problem. Though 
> whether it will or will not amount to a bottleneck indeed remains to be seen, 
> as the code is also doing other things that are potentially expensive, and 
> possibly more so than the regex matching. 
> 
> So the exercise here is merely to see whether there is a notable performance 
> difference in regex pattern evaluation that can't simply be attributed to 
> programming mistakes (and apparently there is).
> 
>    -hilmar
> 
> On Oct 24, 2012, at 2:30 PM, P. Troshin wrote:
> 
>> Hi Hilmar,
>> 
>> Looked at the test in a bit more details, I can see what you are
>> trying to test but is there a real life problem behind this?
>> What this test is doing is a lot of searches on very short strings. Is
>> this what your real life application does? I am asking because if your
>> real life application uses regexp to look into long string, the
>> performance might be totally different.
>> What is your aim - 3 seconds for 500K searches do not seem
>> particularly slow to me.
>> 
>> Thanks
>> Peter
>> 
>> 
>> On 24 October 2012 19:10, P. Troshin <[email protected]> wrote:
>>> Hi Hilmar,
>>> 
>>> Hmm, it looks like I spoke too soon; the previous run was doing
>>> nothing as all of the cases were commented out.
>>> I can now see that the results of my runs are not massively different
>>> from that of yours.
>>> It would help if you could encourage your student to write a few unit
>>> tests so that we know what you are trying to achieve and to simplify
>>> the testing.
>>> 
>>> Just a thought
>>> 
>>> Thanks,
>>> Peter
>>> 
>>> 
>>> 
>>> On 24 October 2012 17:47, Hilmar Lapp <[email protected]> wrote:
>>>> Hi everyone,
>>>> 
>>>> Thanks for all your responses. Indeed I know that the Java regex API isn't 
>>>> an enjoyable one to program with, and if the underlying task were about 
>>>> writing something from scratch, I'd be all for avoiding regex's too if the 
>>>> same thing could be achieved by string comparison.
>>>> 
>>>> However, and of course I failed to say that initially, the task from which 
>>>> this query is originating is about converting a Perl script to Java (not 
>>>> because Perl is somehow bad, but because those Perl scripts have shown to 
>>>> be an obstacle to easy cross-platform installation of the - mostly Java - 
>>>> software they are a part of). That doesn't mean one couldn't in the course 
>>>> also rewrite the code that uses regular expressions to one that doesn't, 
>>>> but I also think it wise not to introduce multiple variables as a source 
>>>> of error at once.
>>>> 
>>>> Some of the responses would be best answered by looking at the expressions 
>>>> and the code that uses them, so here are the two "benchmark" scripts.
>>>> 
>>>> Java: https://gist.github.com/3940931
>>>> Perl: https://gist.github.com/3940780
>>>> 
>>>> I'm also copying Dongye Meng here, who is a CS student at UNC working with 
>>>> us on the project - if anyone has further wisdom to share about how to 
>>>> reduce the performance gap between the two versions, he'd surely 
>>>> appreciate.
>>>> 
>>>>       -hilmar
>>>> 
>>>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote:
>>>> 
>>>>> Hilmar Lapp <[email protected]> writes:
>>>>>> They (at least as in java.util.regex) have been reported to me as
>>>>>> performing much slower (by several orders of magnitude) than the regex
>>>>>> implementation in Perl, and some simple benchmarking tests seem to
>>>>>> bear that out. Even after scrutinizing the benchmark and finding
>>>>>> nothing obvious, I'm still skeptical as to why this would be the case
>>>>>> - naively I would have assumed that the underlying runtime library is
>>>>>> implemented in C in both cases. But perhaps this is not true?
>>>>> 
>>>>> 
>>>>> Well, the difference is that Perl is perl, while Java is not; it all
>>>>> depends on the JVM, and libraries also. A quick shuftie at
>>>>> the source for the open-jdk libraries suggests that the regexp searching
>>>>> is done in Java -- it's not just a drop through to C. Always the problem
>>>>> with performance optimisation on Java -- you are only optimising for one
>>>>> situation. It might be interesting to see how much variation there is
>>>>> between JVMs.
>>>>> 
>>>>> Like others, I would only use regexp as a last resort in Java anyway;
>>>>> compared to Perl, writing the code is painful. Still, I guess that you
>>>>> know this!
>>>>> 
>>>>> Phil
>>>> 
>>>> --
>>>> ===========================================================
>>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
>>>> ===========================================================
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  [email protected]
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
> 
> 
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  [email protected]
> http://lists.open-bio.org/mailman/listinfo/biojava-l

_______________________________________________
Biojava-l mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biojava-l

Reply via email to