Re (1) I ingested the ERIC document and find that with getDatastreamFromTika I 
get an exception, because the document has more than 100000 characters, while 
with PDFBox directly (getDatastreamText) the document gets indexed. This 
probably explains, why some of your documents are not indexed, the longest 
ones. I will investigate, how we can raise that character limit in Tika.

Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries to 
index a datastream, which is not present in that object. You can ignore the 
message, or preferably, change your indexing stylesheet, so that it only tries 
to index datastreams that are known to exist.

-Gert

On 16/01/2012, at 10.16, Serhiy Polyakov wrote:

> I tested Fedora Generic Search 2.4
> 
> (1)
> Focus was on PDF full text indexing. I found that some PDF document
> are full text indexed OK but some are not. Those that are not indexed
> full text can be converted into text using Adobe Acrobat so they are
> not images. Their metadata is indexed alright in Fedora.
> 
> Example of the document that was not full text indexed is from ERIC database:
> 
> "Digest of Education Statistics, 2009. NCES 2010-013"
> 
> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae
> 
> I looked at the fedoragsearch.daily.log and see that fields like
> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF
> document. However, filed like <field name="ds.OBJ"> is absent.
> 
> For other PDF documents that were full test indexed without problems
> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log
> 
> Any suggestion how to fix would help.
> 
> 
> (2)
> Additionally, for each ingest of any object multiple records starting
> with the following records are written in the fedora.log:
> 
> ERROR 2012-01-16 02:45:43.124 [http-8080-4]
> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination
> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess]
> No datastream could be returned. Either there is no datastream for the
> digital object "mynamesp:someid" with datastream ID of "QUERY "  OR
> there are no datastreams that match the specified date/time value of
> "null "  .
> ...
> ...
> 
> "mynamesp:someid" is my collection where I ingest objects.
> 
> Should I ignore those?
> 
> 
> Thank you,
> Serhiy
> 
> ------------------------------------------------------------------------------
> RSA(R) Conference 2012
> Mar 27 - Feb 2
> Save $400 by Jan. 27
> Register now!
> http://p.sf.net/sfu/rsa-sfdev2dev2
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to