That version of tm-extractors is quite old.
There is a newer version on the Google site -
http://code.google.com/p/text-mining/ - but it will take a bit of work
wrapping things up for general use.
It has dependencies on newer versions of POI than 0.4, and some distinct
improvements to it's robustness.
G
On 6 October 2010 16:39, Tim Donohue <tdono...@duraspace.org> wrote:
> Ugh -- sounds like you've entered dependency hell.
>
> Though, I think the one shred of good news here is that it seems to only
> have a dependency conflict in one place in our codebase.
>
> It looks like (at a glance) if our WordFilter can be re-written to no
> longer need the org.textmining project, you *might* be OK (i.e.
> hopefully it wouldn't snowball on you). But, that would require finding
> a Word document text extractor that is as good as (or better than) that
> 'org.textmining' one, and then hoping it doesn't cause another
> dependency conflict. Not sure of any alternative Word text extractors,
> off the top of my head, but maybe others know of one?
>
> - Tim
>
>
> On 10/6/2010 5:51 AM, Keith Gilbertson wrote:
> > Thanks, Tim. That helped me to understand. I put the version numbers of
> the dependency in the parent pom.xml ('dspace-src/pom.xml') and left the
> version numbers out of 'dspace-src/dspace-api/pom.xml'.
> >
> > So then I found another thing I didn't look at closely enough. The
> WordFilter doesn't use poi directly, but the org.textmining project that it
> uses depends on that old version of POI. To confuse things more, the old
> versions of poi had groupId 'poi', and the new versions have groupId
> 'org.apache.poi'.
> > I can convince Maven to forget about the old version of the POIi library
> by making this exclusion change in the parent pom:
> > <dependency>
> > <groupId>org.textmining</groupId>
> > <artifactId>tm-extractors</artifactId>
> > <version>0.4</version>
> > <exclusions>
> > <exclusion>
> > <groupId>poi</groupId>
> > <artifactId>poi</artifactId>
> > </exclusion>
> > </exclusions>
> > </dependency>
> > <dependency>
> >
> > Then only the new version, org.apache.poi/poi/3.6 is included in the
> project. Unfortunately, the org.textmining extractors really do need that
> version of POI. The PowerPointFilter works, but I've broken the WordFilter:
> >
> > Exception:
> org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryEntry;
> > java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.POIFSFileSystem.getRoot()Lorg/apache/poi/poifs/filesystem/DirectoryEntry;
> > at
> org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:51)
> > at
> org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95)
> >
> > I have two programs that share the same classpath, but need different
> versions of the same library.
> >
> > I could rewrite the WordFilter so that it no longer uses the
> org.textmining package which needs the old library, but I keep thinking that
> the more I try to "fix" stuff, the more I'm likely to break:
> >
> >
> http://www.nypost.com/p/news/local/brooklyn/rat_bastards_f5onjzgcqxm0fu3RFz3ySL
> >
> >
> >
> > On Oct 5, 2010, at 4:09 PM, Tim Donohue wrote:
> >
> >> Hi Keith,
> >>
> >> Simply put, it's because you were accidentally looking in the wrong
> pom.xml :) There's many of them sprinkled through the DSpace codebase, and
> they all inherit many of their settings from one main pom.xml.
> >>
> >> So, you noticed that the 'dspace-api/pom.xml' file included a dependency
> for "poi". But, if you look closely, that dependency doesn't list
> a<version>. This is because, for DSpace, we manage all the versions of
> dependencies in one parent pom.xml (which is loaded via the<parent> tag
> within the dspace-aip/pom.xml).
> >>
> >> Now, take a look at the [dspace-src]/pom.xml. This is the main Parent
> pom.xml for dspace (with an artifactid of 'dspace-parent')
> >>
> >> http://scm.dspace.org/svn/repo/dspace/trunk/pom.xml
> >>
> >> This is the pom.xml which actually lists the versions of every
> dependency used by the various APIs of DSpace. If you search in this
> pom.xml, you'll find this entry:
> >>
> >> <dependency>
> >> <groupId>poi</groupId>
> >> <artifactId>poi</artifactId>
> >> <version>2.5.1-final-20040804</version>
> >> </dependency>
> >>
> >> That's where the 2.5.1 version is sneaking in. If you make your
> necessary changes to this pom.xml, everything should act as you expect it
> to. So, just undo your changes in 'dspace-src/dspace-api/pom.xml', and
> instead make those changes to 'dspace-src/pom.xml'
> >>
> >> I hope that helps!
> >>
> >> - Tim
> >>
> >> On 10/5/2010 2:36 PM, Keith Gilbertson wrote:
> >>> Hi,
> >>>
> >>> I've been experimenting with a Media Filter for text extraction from
> PowerPoint files. It's based on the Apache POI libraries, as was suggested
> by others in a previous thread.
> >>>
> >>> It uses the poi, poi-scratchpad, and poi-ooxml artifacts, in version
> 3.6, the latest release version from Apache. I haven't done much with
> Maven, and am not sure how to tell it which libraries I need.
> >>>
> >>> This bit was already in the dspace-api/pom.xml file:
> >>> <dependency>
> >>> -<groupId>poi</groupId>
> >>> -<artifactId>poi</artifactId>
> >>> -</dependency>
> >>>
> >>>
> >>> I removed it, because I wanted the latest version of the libraries.
> Then, I added these dependencies to the bottom of the file:
> >>>
> >>> +<dependency>
> >>> +<groupId>org.apache.poi</groupId>
> >>> +<artifactId>poi</artifactId>
> >>> +<version>3.6</version>
> >>> +</dependency>
> >>> +<dependency>
> >>> +<groupId>org.apache.poi</groupId>
> >>> +<artifactId>poi-scratchpad</artifactId>
> >>> +<version>3.6</version>
> >>> +</dependency>
> >>> +<dependency>
> >>> +<groupId>org.apache.poi</groupId>
> >>> +<artifactId>poi-ooxml</artifactId>
> >>> +<version>3.6</version>
> >>> +</dependency>
> >>>
> >>> Somehow Maven magically found the correct versions of the dependencies,
> and everything built fine. When I deployed DSpace and looked in the lib
> directory, there were two versions of the main poi library there:
> >>>
> >>> poi-2.5.1-final-20040804.jar
> >>> poi-3.6.jar
> >>> poi-ooxml-3.6.jar
> >>> poi-ooxml-schemas-3.6.jar
> >>> poi-scratchpad-3.6.jar
> >>>
> >>> I couldn't figure out why the poi-2.5.1 version was still there, or
> find anything that actually used it. So, in the interest of doing some
> quick testing, I just deleted it.
> >>>
> >>> Can someone give a hand on how to do this properly? I'm trying to tell
> the build process to find and use only version 3.6 of poi.
> >>>
> >>> Thank you!
> >>> --keith
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Beautiful is writing same markup. Internet Explorer 9 supports
> >>> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2& L3.
> >>> Spend less time writing and rewriting code and more time creating
> great
> >>> experiences on the web. Be a part of the beta today.
> >>> http://p.sf.net/sfu/beautyoftheweb
> >>> _______________________________________________
> >>> DSpace-tech mailing list
> >>> DSpace-tech@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> >
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3.
> Spend less time writing and rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3.
Spend less time writing and rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech