Gert, These clarify my concerns. Please let us know about Tika limit increase if you find some way. Otherwise I will try to switch to PDF Box later.
Thank you! --Serhiy On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen <gerts...@gmail.com> wrote: > Re (1) I ingested the ERIC document and find that with getDatastreamFromTika > I get an exception, because the document has more than 100000 characters, > while with PDFBox directly (getDatastreamText) the document gets indexed. > This probably explains, why some of your documents are not indexed, the > longest ones. I will investigate, how we can raise that character limit in > Tika. > > Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries > to index a datastream, which is not present in that object. You can ignore > the message, or preferably, change your indexing stylesheet, so that it only > tries to index datastreams that are known to exist. > > -Gert > > On 16/01/2012, at 10.16, Serhiy Polyakov wrote: > >> I tested Fedora Generic Search 2.4 >> >> (1) >> Focus was on PDF full text indexing. I found that some PDF document >> are full text indexed OK but some are not. Those that are not indexed >> full text can be converted into text using Adobe Acrobat so they are >> not images. Their metadata is indexed alright in Fedora. >> >> Example of the document that was not full text indexed is from ERIC database: >> >> "Digest of Education Statistics, 2009. NCES 2010-013" >> >> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae >> >> I looked at the fedoragsearch.daily.log and see that fields like >> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF >> document. However, filed like <field name="ds.OBJ"> is absent. >> >> For other PDF documents that were full test indexed without problems >> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log >> >> Any suggestion how to fix would help. >> >> >> (2) >> Additionally, for each ingest of any object multiple records starting >> with the following records are written in the fedora.log: >> >> ERROR 2012-01-16 02:45:43.124 [http-8080-4] >> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination >> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess] >> No datastream could be returned. Either there is no datastream for the >> digital object "mynamesp:someid" with datastream ID of "QUERY " OR >> there are no datastreams that match the specified date/time value of >> "null " . >> ... >> ... >> >> "mynamesp:someid" is my collection where I ingest objects. >> >> Should I ignore those? >> >> >> Thank you, >> Serhiy >> >> ------------------------------------------------------------------------------ >> RSA(R) Conference 2012 >> Mar 27 - Feb 2 >> Save $400 by Jan. 27 >> Register now! >> http://p.sf.net/sfu/rsa-sfdev2dev2 >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Mar 27 - Feb 2 > Save $400 by Jan. 27 > Register now! > http://p.sf.net/sfu/rsa-sfdev2dev2 > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users