Improved control of the writeLimit in Apache Tika is now included in GSearch 
2.4.1, which is released today, see

  https://wiki.duraspace.org/display/FCSVCS/Generic+Search+Service+2.4.1 

-Gert


Begin forwarded message:

> From: Gert Schmeltz Pedersen <gerts...@gmail.com>
> Date: 22. jan 2012 12.28.05 CET
> To: "Support and info exchange list for Fedora users." 
> <fedora-commons-users@lists.sourceforge.net>
> Subject: Re: [fcrepo-user] Fedora Generic Search 2.4 test: some PDF documents 
> not indexed full text
> 
> PDFBox can extract text from PDF files only. Before the inclusion of Tika in 
> GSearch 2.4, GSearch could not extract text from the other types. Improved 
> control of the writeLimit in Tika will be released in GSearch during the 
> coming week, then all the types are available without length restriction.
> 
> -Gert
> 
> 
> On 20/01/2012, at 23.38, Serhiy Polyakov wrote:
> 
>> I am using GSearch 2.4. If I still want to full-text index very large
>> documents I understand I can switch from Tika back to PDFBox in the
>> configuration (getDatastreamFromTika -> getDatastreamText). I also
>> want to full-text index MSWord, Excel, PowerPoint and other types.
>> Which component of software will be actually doing extraction from
>> those file types if I switch to PDFBox?
>> 
>> Thanks,
>> Serhiy
>> 
>> 
>> On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen
>> <gerts...@gmail.com> wrote:
>>> Re (1) I ingested the ERIC document and find that with 
>>> getDatastreamFromTika I get an exception, because the document has more 
>>> than 100000 characters, while with PDFBox directly (getDatastreamText) the 
>>> document gets indexed. This probably explains, why some of your documents 
>>> are not indexed, the longest ones. I will investigate, how we can raise 
>>> that character limit in Tika.
>>> 
>>> Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries 
>>> to index a datastream, which is not present in that object. You can ignore 
>>> the message, or preferably, change your indexing stylesheet, so that it 
>>> only tries to index datastreams that are known to exist.
>>> 
>>> -Gert
>>> 
>>> On 16/01/2012, at 10.16, Serhiy Polyakov wrote:
>>> 
>>>> I tested Fedora Generic Search 2.4
>>>> 
>>>> (1)
>>>> Focus was on PDF full text indexing. I found that some PDF document
>>>> are full text indexed OK but some are not. Those that are not indexed
>>>> full text can be converted into text using Adobe Acrobat so they are
>>>> not images. Their metadata is indexed alright in Fedora.
>>>> 
>>>> Example of the document that was not full text indexed is from ERIC 
>>>> database:
>>>> 
>>>> "Digest of Education Statistics, 2009. NCES 2010-013"
>>>> 
>>>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae
>>>> 
>>>> I looked at the fedoragsearch.daily.log and see that fields like
>>>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF
>>>> document. However, filed like <field name="ds.OBJ"> is absent.
>>>> 
>>>> For other PDF documents that were full test indexed without problems
>>>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log
>>>> 
>>>> Any suggestion how to fix would help.
>>>> 
>>>> 
>>>> (2)
>>>> Additionally, for each ingest of any object multiple records starting
>>>> with the following records are written in the fedora.log:
>>>> 
>>>> ERROR 2012-01-16 02:45:43.124 [http-8080-4]
>>>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination
>>>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess]
>>>> No datastream could be returned. Either there is no datastream for the
>>>> digital object "mynamesp:someid" with datastream ID of "QUERY "  OR
>>>> there are no datastreams that match the specified date/time value of
>>>> "null "  .
>>>> ...
>>>> ...
>>>> 
>>>> "mynamesp:someid" is my collection where I ingest objects.
>>>> 
>>>> Should I ignore those?
>>>> 
>>>> 
>>>> Thank you,
>>>> Serhiy
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> RSA(R) Conference 2012
>>>> Mar 27 - Feb 2
>>>> Save $400 by Jan. 27
>>>> Register now!
>>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>>> _______________________________________________
>>>> Fedora-commons-users mailing list
>>>> Fedora-commons-users@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> RSA(R) Conference 2012
>>> Mar 27 - Feb 2
>>> Save $400 by Jan. 27
>>> Register now!
>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> 
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to