Improved control of the writeLimit in Apache Tika is now included in GSearch
2.4.1, which is released today, see
https://wiki.duraspace.org/display/FCSVCS/Generic+Search+Service+2.4.1
-Gert
Begin forwarded message:
> From: Gert Schmeltz Pedersen <gerts...@gmail.com>
> Date: 22. jan 2012 12.28.05 CET
> To: "Support and info exchange list for Fedora users."
> <fedora-commons-users@lists.sourceforge.net>
> Subject: Re: [fcrepo-user] Fedora Generic Search 2.4 test: some PDF documents
> not indexed full text
>
> PDFBox can extract text from PDF files only. Before the inclusion of Tika in
> GSearch 2.4, GSearch could not extract text from the other types. Improved
> control of the writeLimit in Tika will be released in GSearch during the
> coming week, then all the types are available without length restriction.
>
> -Gert
>
>
> On 20/01/2012, at 23.38, Serhiy Polyakov wrote:
>
>> I am using GSearch 2.4. If I still want to full-text index very large
>> documents I understand I can switch from Tika back to PDFBox in the
>> configuration (getDatastreamFromTika -> getDatastreamText). I also
>> want to full-text index MSWord, Excel, PowerPoint and other types.
>> Which component of software will be actually doing extraction from
>> those file types if I switch to PDFBox?
>>
>> Thanks,
>> Serhiy
>>
>>
>> On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen
>> <gerts...@gmail.com> wrote:
>>> Re (1) I ingested the ERIC document and find that with
>>> getDatastreamFromTika I get an exception, because the document has more
>>> than 100000 characters, while with PDFBox directly (getDatastreamText) the
>>> document gets indexed. This probably explains, why some of your documents
>>> are not indexed, the longest ones. I will investigate, how we can raise
>>> that character limit in Tika.
>>>
>>> Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries
>>> to index a datastream, which is not present in that object. You can ignore
>>> the message, or preferably, change your indexing stylesheet, so that it
>>> only tries to index datastreams that are known to exist.
>>>
>>> -Gert
>>>
>>> On 16/01/2012, at 10.16, Serhiy Polyakov wrote:
>>>
>>>> I tested Fedora Generic Search 2.4
>>>>
>>>> (1)
>>>> Focus was on PDF full text indexing. I found that some PDF document
>>>> are full text indexed OK but some are not. Those that are not indexed
>>>> full text can be converted into text using Adobe Acrobat so they are
>>>> not images. Their metadata is indexed alright in Fedora.
>>>>
>>>> Example of the document that was not full text indexed is from ERIC
>>>> database:
>>>>
>>>> "Digest of Education Statistics, 2009. NCES 2010-013"
>>>>
>>>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae
>>>>
>>>> I looked at the fedoragsearch.daily.log and see that fields like
>>>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF
>>>> document. However, filed like <field name="ds.OBJ"> is absent.
>>>>
>>>> For other PDF documents that were full test indexed without problems
>>>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log
>>>>
>>>> Any suggestion how to fix would help.
>>>>
>>>>
>>>> (2)
>>>> Additionally, for each ingest of any object multiple records starting
>>>> with the following records are written in the fedora.log:
>>>>
>>>> ERROR 2012-01-16 02:45:43.124 [http-8080-4]
>>>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination
>>>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess]
>>>> No datastream could be returned. Either there is no datastream for the
>>>> digital object "mynamesp:someid" with datastream ID of "QUERY " OR
>>>> there are no datastreams that match the specified date/time value of
>>>> "null " .
>>>> ...
>>>> ...
>>>>
>>>> "mynamesp:someid" is my collection where I ingest objects.
>>>>
>>>> Should I ignore those?
>>>>
>>>>
>>>> Thank you,
>>>> Serhiy
>>>>
>>>> ------------------------------------------------------------------------------
>>>> RSA(R) Conference 2012
>>>> Mar 27 - Feb 2
>>>> Save $400 by Jan. 27
>>>> Register now!
>>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>>> _______________________________________________
>>>> Fedora-commons-users mailing list
>>>> Fedora-commons-users@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> RSA(R) Conference 2012
>>> Mar 27 - Feb 2
>>> Save $400 by Jan. 27
>>> Register now!
>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users