Have you ever used uima? Same software used on the IBM Watson project. Very very powerful.
http://uima.apache.org/ Dan Sent from my iPhone On Oct 24, 2012, at 10:45 PM, Hilmar Lapp <[email protected]> wrote: > The code is a very small snippet from a natural language processing software > aimed at extracting structured phenotype descriptions from un- or > semistructured free text. Apparently the code as is (in Perl) makes a lot of > regular expression matches, and so if the speed difference for them between > Perl and Java is significant, in theory this might become a problem. Though > whether it will or will not amount to a bottleneck indeed remains to be seen, > as the code is also doing other things that are potentially expensive, and > possibly more so than the regex matching. > > So the exercise here is merely to see whether there is a notable performance > difference in regex pattern evaluation that can't simply be attributed to > programming mistakes (and apparently there is). > > -hilmar > > On Oct 24, 2012, at 2:30 PM, P. Troshin wrote: > >> Hi Hilmar, >> >> Looked at the test in a bit more details, I can see what you are >> trying to test but is there a real life problem behind this? >> What this test is doing is a lot of searches on very short strings. Is >> this what your real life application does? I am asking because if your >> real life application uses regexp to look into long string, the >> performance might be totally different. >> What is your aim - 3 seconds for 500K searches do not seem >> particularly slow to me. >> >> Thanks >> Peter >> >> >> On 24 October 2012 19:10, P. Troshin <[email protected]> wrote: >>> Hi Hilmar, >>> >>> Hmm, it looks like I spoke too soon; the previous run was doing >>> nothing as all of the cases were commented out. >>> I can now see that the results of my runs are not massively different >>> from that of yours. >>> It would help if you could encourage your student to write a few unit >>> tests so that we know what you are trying to achieve and to simplify >>> the testing. >>> >>> Just a thought >>> >>> Thanks, >>> Peter >>> >>> >>> >>> On 24 October 2012 17:47, Hilmar Lapp <[email protected]> wrote: >>>> Hi everyone, >>>> >>>> Thanks for all your responses. Indeed I know that the Java regex API isn't >>>> an enjoyable one to program with, and if the underlying task were about >>>> writing something from scratch, I'd be all for avoiding regex's too if the >>>> same thing could be achieved by string comparison. >>>> >>>> However, and of course I failed to say that initially, the task from which >>>> this query is originating is about converting a Perl script to Java (not >>>> because Perl is somehow bad, but because those Perl scripts have shown to >>>> be an obstacle to easy cross-platform installation of the - mostly Java - >>>> software they are a part of). That doesn't mean one couldn't in the course >>>> also rewrite the code that uses regular expressions to one that doesn't, >>>> but I also think it wise not to introduce multiple variables as a source >>>> of error at once. >>>> >>>> Some of the responses would be best answered by looking at the expressions >>>> and the code that uses them, so here are the two "benchmark" scripts. >>>> >>>> Java: https://gist.github.com/3940931 >>>> Perl: https://gist.github.com/3940780 >>>> >>>> I'm also copying Dongye Meng here, who is a CS student at UNC working with >>>> us on the project - if anyone has further wisdom to share about how to >>>> reduce the performance gap between the two versions, he'd surely >>>> appreciate. >>>> >>>> -hilmar >>>> >>>> On Oct 23, 2012, at 6:42 AM, Phillip Lord wrote: >>>> >>>>> Hilmar Lapp <[email protected]> writes: >>>>>> They (at least as in java.util.regex) have been reported to me as >>>>>> performing much slower (by several orders of magnitude) than the regex >>>>>> implementation in Perl, and some simple benchmarking tests seem to >>>>>> bear that out. Even after scrutinizing the benchmark and finding >>>>>> nothing obvious, I'm still skeptical as to why this would be the case >>>>>> - naively I would have assumed that the underlying runtime library is >>>>>> implemented in C in both cases. But perhaps this is not true? >>>>> >>>>> >>>>> Well, the difference is that Perl is perl, while Java is not; it all >>>>> depends on the JVM, and libraries also. A quick shuftie at >>>>> the source for the open-jdk libraries suggests that the regexp searching >>>>> is done in Java -- it's not just a drop through to C. Always the problem >>>>> with performance optimisation on Java -- you are only optimising for one >>>>> situation. It might be interesting to see how much variation there is >>>>> between JVMs. >>>>> >>>>> Like others, I would only use regexp as a last resort in Java anyway; >>>>> compared to Perl, writing the code is painful. Still, I guess that you >>>>> know this! >>>>> >>>>> Phil >>>> >>>> -- >>>> =========================================================== >>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : >>>> =========================================================== >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - [email protected] >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - [email protected] > http://lists.open-bio.org/mailman/listinfo/biojava-l _______________________________________________ Biojava-l mailing list - [email protected] http://lists.open-bio.org/mailman/listinfo/biojava-l
