Good to know that it can be made to work. ;-) We should probably look at Lucene/Solr 3.1, which was just released, and is the next Solr version after 1.4.1, and see whether anything special is needed there.
Karl On Mon, Apr 4, 2011 at 11:01 AM, Erlend Garåsen <[email protected]> wrote: > > After I downloaded and replaced the following jars, I no longer have a > character encoding problem: > pdfbox-1.5.0.jar > fontbox-1.5.0.jar > jempbox-1.5.0.jar > > Erlend > > On 31.03.11 14.35, Karl Wright wrote: >> >> It might be worth cross-posting this to the Tika user or dev list. >> Jukka Zitting is one of the principal Tika developers and he's also a >> committer for MCF, but I'm not sure he'll notice it go by otherwise. >> >> In case you're wondering how to update the MCF FAQ, it's in the Wiki >> so all you need to do is sign up and you'll be able to update it. >> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ >> >> Karl >> >> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<[email protected]> >> wrote: >>> >>> Oh, there's more unfortunately. Some of the Tika dependencies need to be >>> further updated. I couldn't parse the date from PDF documents correctly. >>> I'm >>> not quite sure which of the extracting libraries causing this problem >>> (probably pdfbox). Anyway, I can now extract contents from the following >>> document formats without any problems: >>> - HTML >>> - RTF >>> - DOC >>> - DOCX >>> - ODT >>> - XLSX >>> - XLS >>> - SXW >>> - PDF >>> >>> I'm using the following jars: >>> apache-solr-cell-1.4.2-dev.jar >>> geronimo-stax-api_1.0_spec-1.0.1.jar >>> poi-scratchpad-3.7.jar >>> asm-3.1.jar >>> icu4j-4_6.jar >>> rome-0.9.jar >>> bcmail-jdk15-1.45.jar >>> jempbox-1.3.1.jar >>> tagsoup-1.2.jar >>> bcprov-jdk15-1.45.jar >>> metadata-extractor-2.4.0-beta-1.jar >>> tika-core-0.8.jar >>> boilerpipe-1.1.0.jar >>> netcdf-4.2.jar >>> tika-parsers-0.8.jar >>> commons-compress-1.1.jar >>> pdfbox-1.3.1.jar >>> commons-logging-1.1.1.jar >>> poi-3.7.jar >>> xercesImpl-2.8.1.jar >>> dom4j-1.6.1.jar >>> poi-ooxml-3.7.jar >>> xml-apis-1.0.b2.jar >>> fontbox-1.3.1.jar >>> poi-ooxml-schemas-3.7.jar >>> xmlbeans-2.3.0.jar >>> >>> But I still have some problems with PDF documents[1]. I'm not sure >>> whether >>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be >>> displayed correctly after Solr has indexed the document. The characters >>> are >>> replaced by a question mark. >>> >>> [1] http://ridder.uio.no/dokument.pdf >>> >>> Erlend >>> >>> On 30.03.11 18.09, Karl Wright wrote: >>>> >>>> Certainly it makes sense to start with the FAQ, especially for places >>>> where you are tripping over known bugs. We can always do a site page >>>> later. >>>> >>>> Thanks! >>>> Karl >>>> >>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen >>>> <[email protected]> wrote: >>>>> >>>>> On 30.03.11 18.00, Karl Wright wrote: >>>>>> >>>>>> It would be great if this information went at least into the FAQ, and >>>>>> even better if we added a page to the site documentation. I'm >>>>>> thinking maybe a whole page titled "Integrating with Solr", which >>>>>> would walk you through the process and the pitfalls. What do you >>>>>> think? >>>>> >>>>> Yes, I think so. >>>>> >>>>> The next version of Solr will probably be released soon, and then it >>>>> will >>>>> be >>>>> much easier to integrate Solr. Maybe it is sufficient to add the >>>>> information >>>>> into the FAQ since the problem mentioned only affects 1.4.1? >>>>> >>>>> Erlend >>>>> >>>>> -- >>>>> Erlend Garåsen >>>>> Center for Information Technology Services >>>>> University of Oslo >>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>>> 31050 >>>>> >>> >>> >>> -- >>> Erlend Garåsen >>> Center for Information Technology Services >>> University of Oslo >>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>> 31050 >>> > > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >
