I have not heard about such uses or solutions. -Gert
On 22/01/2012, at 21.31, Serhiy Polyakov wrote: > That’s great about upcoming improved control option! > > Just curious about retrospect and best practice. Do you know if > developers were at all utilizing/implementing full text indexing for > the non-PDF datastreams (like MS Word, and other office documents) > stored in the Fedora repositories before GSearch 2.4? If so probably > they used some alternative or custom solutions? > > Thank you, > Serhiy > > > > > On Sun, Jan 22, 2012 at 5:28 AM, Gert Schmeltz Pedersen > <gerts...@gmail.com> wrote: >> PDFBox can extract text from PDF files only. Before the inclusion of Tika in >> GSearch 2.4, GSearch could not extract text from the other types. Improved >> control of the writeLimit in Tika will be released in GSearch during the >> coming week, then all the types are available without length restriction. >> >> -Gert >> >> >> On 20/01/2012, at 23.38, Serhiy Polyakov wrote: >> >>> I am using GSearch 2.4. If I still want to full-text index very large >>> documents I understand I can switch from Tika back to PDFBox in the >>> configuration (getDatastreamFromTika -> getDatastreamText). I also >>> want to full-text index MSWord, Excel, PowerPoint and other types. >>> Which component of software will be actually doing extraction from >>> those file types if I switch to PDFBox? >>> >>> Thanks, >>> Serhiy >>> >>> >>> On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen >>> <gerts...@gmail.com> wrote: >>>> Re (1) I ingested the ERIC document and find that with >>>> getDatastreamFromTika I get an exception, because the document has more >>>> than 100000 characters, while with PDFBox directly (getDatastreamText) the >>>> document gets indexed. This probably explains, why some of your documents >>>> are not indexed, the longest ones. I will investigate, how we can raise >>>> that character limit in Tika. >>>> >>>> Re (2) The ERROR in fedora.log comes because your indexing stylesheet >>>> tries to index a datastream, which is not present in that object. You can >>>> ignore the message, or preferably, change your indexing stylesheet, so >>>> that it only tries to index datastreams that are known to exist. >>>> >>>> -Gert >>>> >>>> On 16/01/2012, at 10.16, Serhiy Polyakov wrote: >>>> >>>>> I tested Fedora Generic Search 2.4 >>>>> >>>>> (1) >>>>> Focus was on PDF full text indexing. I found that some PDF document >>>>> are full text indexed OK but some are not. Those that are not indexed >>>>> full text can be converted into text using Adobe Acrobat so they are >>>>> not images. Their metadata is indexed alright in Fedora. >>>>> >>>>> Example of the document that was not full text indexed is from ERIC >>>>> database: >>>>> >>>>> "Digest of Education Statistics, 2009. NCES 2010-013" >>>>> >>>>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae >>>>> >>>>> I looked at the fedoragsearch.daily.log and see that fields like >>>>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF >>>>> document. However, filed like <field name="ds.OBJ"> is absent. >>>>> >>>>> For other PDF documents that were full test indexed without problems >>>>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log >>>>> >>>>> Any suggestion how to fix would help. >>>>> >>>>> >>>>> (2) >>>>> Additionally, for each ingest of any object multiple records starting >>>>> with the following records are written in the fedora.log: >>>>> >>>>> ERROR 2012-01-16 02:45:43.124 [http-8080-4] >>>>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination >>>>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess] >>>>> No datastream could be returned. Either there is no datastream for the >>>>> digital object "mynamesp:someid" with datastream ID of "QUERY " OR >>>>> there are no datastreams that match the specified date/time value of >>>>> "null " . >>>>> ... >>>>> ... >>>>> >>>>> "mynamesp:someid" is my collection where I ingest objects. >>>>> >>>>> Should I ignore those? >>>>> >>>>> >>>>> Thank you, >>>>> Serhiy >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> RSA(R) Conference 2012 >>>>> Mar 27 - Feb 2 >>>>> Save $400 by Jan. 27 >>>>> Register now! >>>>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>>>> _______________________________________________ >>>>> Fedora-commons-users mailing list >>>>> Fedora-commons-users@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> RSA(R) Conference 2012 >>>> Mar 27 - Feb 2 >>>> Save $400 by Jan. 27 >>>> Register now! >>>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>>> _______________________________________________ >>>> Fedora-commons-users mailing list >>>> Fedora-commons-users@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >>> >>> ------------------------------------------------------------------------------ >>> Try before you buy = See our experts in action! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-dev2 >>> _______________________________________________ >>> Fedora-commons-users mailing list >>> Fedora-commons-users@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users >> >> >> ------------------------------------------------------------------------------ >> Try before you buy = See our experts in action! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-dev2 >> _______________________________________________ >> Fedora-commons-users mailing list >> Fedora-commons-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Fedora-commons-users mailing list > Fedora-commons-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-users ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users