Gert,

These clarify my concerns. Please let us know about Tika limit
increase if you find some way. Otherwise I will try to switch to PDF
Box later.

Thank you!
--Serhiy


On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen
<gerts...@gmail.com> wrote:
> Re (1) I ingested the ERIC document and find that with getDatastreamFromTika 
> I get an exception, because the document has more than 100000 characters, 
> while with PDFBox directly (getDatastreamText) the document gets indexed. 
> This probably explains, why some of your documents are not indexed, the 
> longest ones. I will investigate, how we can raise that character limit in 
> Tika.
>
> Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries 
> to index a datastream, which is not present in that object. You can ignore 
> the message, or preferably, change your indexing stylesheet, so that it only 
> tries to index datastreams that are known to exist.
>
> -Gert
>
> On 16/01/2012, at 10.16, Serhiy Polyakov wrote:
>
>> I tested Fedora Generic Search 2.4
>>
>> (1)
>> Focus was on PDF full text indexing. I found that some PDF document
>> are full text indexed OK but some are not. Those that are not indexed
>> full text can be converted into text using Adobe Acrobat so they are
>> not images. Their metadata is indexed alright in Fedora.
>>
>> Example of the document that was not full text indexed is from ERIC database:
>>
>> "Digest of Education Statistics, 2009. NCES 2010-013"
>>
>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae
>>
>> I looked at the fedoragsearch.daily.log and see that fields like
>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF
>> document. However, filed like <field name="ds.OBJ"> is absent.
>>
>> For other PDF documents that were full test indexed without problems
>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log
>>
>> Any suggestion how to fix would help.
>>
>>
>> (2)
>> Additionally, for each ingest of any object multiple records starting
>> with the following records are written in the fedora.log:
>>
>> ERROR 2012-01-16 02:45:43.124 [http-8080-4]
>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination
>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess]
>> No datastream could be returned. Either there is no datastream for the
>> digital object "mynamesp:someid" with datastream ID of "QUERY "  OR
>> there are no datastreams that match the specified date/time value of
>> "null "  .
>> ...
>> ...
>>
>> "mynamesp:someid" is my collection where I ingest objects.
>>
>> Should I ignore those?
>>
>>
>> Thank you,
>> Serhiy
>>
>> ------------------------------------------------------------------------------
>> RSA(R) Conference 2012
>> Mar 27 - Feb 2
>> Save $400 by Jan. 27
>> Register now!
>> http://p.sf.net/sfu/rsa-sfdev2dev2
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>
>
> ------------------------------------------------------------------------------
> RSA(R) Conference 2012
> Mar 27 - Feb 2
> Save $400 by Jan. 27
> Register now!
> http://p.sf.net/sfu/rsa-sfdev2dev2
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to