Hi, If you look in launchpad/builder/src/main/bundles/list.xml you will find the tika bundles (below) that will almost certainly export what you need to use Tika directly. Those bundles will be in the maven repo. If you need a different version, then just add another bundle. If that makes Jackrabbit unstable, (unlikely), then embed it. OSGi is good like that ;).
You can also run Tika on the command line which can be a good way of isolating its heap usage if dealing with poorly formed PDF files. <bundle> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.0</version> </bundle> <bundle> <groupId>org.apache.tika</groupId> <artifactId>tika-bundle</artifactId> <version>1.0</version> </bundle> HTH Ian On 13 February 2013 02:56, Robert A. Decker <dec...@robdecker.com> wrote: > Hi, > > We would like to use tika to extract raw text from pdfs, word docs, etc. > > I've found a maven dependency for jackrabbit-text-extractors: > http://mvnrepository.com/artifact/org.apache.jackrabbit/jackrabbit-text-extractors/1.6.5 > > I also see that there's a sling bundle > org.apache.sling.jackrabbit-text-extractors > > If all we want to do is extract text from documents (and don't really care > about lucene indexing) then should we use the text extractors directly as a > maven dependency? Or should we use the sling bundle and use it as a service? > > One problem I'm having is that I can't find the sling bundle in my svn clone > of the sling repo. Does the code live somewhere else? Also, I see that even > without this bundle there are a couple of tika services available in sling. > > Can someone provides some tips on getting started? > > thanks, > Rob