That’s great about upcoming improved control option! Just curious about retrospect and best practice. Do you know if developers were at all utilizing/implementing full text indexing for the non-PDF datastreams (like MS Word, and other office documents) stored in the Fedora repositories before GSearch 2.4? If so probably they used some alternative or custom solutions?
Thank you, Serhiy On Sun, Jan 22, 2012 at 5:28 AM, Gert Schmeltz Pedersen <gerts...@gmail.com> wrote: > PDFBox can extract text from PDF files only. Before the inclusion of Tika in > GSearch 2.4, GSearch could not extract text from the other types. Improved > control of the writeLimit in Tika will be released in GSearch during the > coming week, then all the types are available without length restriction. > > -Gert > > > On 20/01/2012, at 23.38, Serhiy Polyakov wrote: > >> I am using GSearch 2.4. If I still want to full-text index very large >> documents I understand I can switch from Tika back to PDFBox in the >> configuration (getDatastreamFromTika -> getDatastreamText). I also >> want to full-text index MSWord, Excel, PowerPoint and other types. >> Which component of software will be actually doing extraction from >> those file types if I switch to PDFBox? >> >> Thanks, >> Serhiy >> >> >> On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen >> <gerts...@gmail.com> wrote: >>> Re (1) I ingested the ERIC document and find that with >>> getDatastreamFromTika I get an exception, because the document has more >>> than 100000 characters, while with PDFBox directly (getDatastreamText) the >>> document gets indexed. This probably explains, why some of your documents >>> are not indexed, the longest ones. I will investigate, how we can raise >>> that character limit in Tika. >>> >>> Re (2) The ERROR in fedora.log comes because your indexing stylesheet tries >>> to index a datastream, which is not present in that object. You can ignore >>> the message, or preferably, change your indexing stylesheet, so that it >>> only tries to index datastreams that are known to exist. >>> >>> -Gert >>> >>> On 16/01/2012, at 10.16, Serhiy Polyakov wrote: >>> >>>> I tested Fedora Generic Search 2.4 >>>> >>>> (1) >>>> Focus was on PDF full text indexing. I found that some PDF document >>>> are full text indexed OK but some are not. Those that are not indexed >>>> full text can be converted into text using Adobe Acrobat so they are >>>> not images. Their metadata is indexed alright in Fedora. >>>> >>>> Example of the document that was not full text indexed is from ERIC >>>> database: >>>> >>>> "Digest of Education Statistics, 2009. NCES 2010-013" >>>> >>>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae >>>> >>>> I looked at the fedoragsearch.daily.log and see that fields like >>>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF >>>> document. However, filed like <field name="ds.OBJ"> is absent. >>>> >>>> For other PDF documents that were full test indexed without problems >>>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log >>>> >>>> Any suggestion how to fix would help. >>>> >>>> >>>> (2) >>>> Additionally, for each ingest of any object multiple records starting >>>> with the following records are written in the fedora.log: >>>> >>>> ERROR 2012-01-16 02:45:43.124 [http-8080-4] >>>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination >>>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess] >>>> No datastream could be returned. Either there is no datastream for the >>>> digital object "mynamesp:someid" with datastream ID of "QUERY " OR >>>> there are no datastreams that match the specified date/time value of >>>> "null " . >>>> ... >>>> ... >>>> >>>> "mynamesp:someid" is my collection where I ingest objects. >>>> >>>> Should I ignore those? >>>> >>>> >>>> Thank you, >>>> Serhiy >>>> >>>> ------------------------------------------------------------------------------ >>>> RSA(R) Conference 2012 >>>> Mar 27 - Feb 2 >>>> Save $400 by Jan. 27 >>>> Register now! >>>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>>> _______________________________________________ >>>> Fedora-commons-users mailing list >>>> Fedora-commons-users@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >>> >>> >>> ------------------------------------------------------------------------------ >>> RSA(R) Conference 2012 >>> Mar 27 - Feb 2 >>> Save $400 by Jan. 27 >>> Register now! >>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>> _______________________________________________ >>> Fedora-commons-users mailing list >>> Fedora-commons-users@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users