I have not heard about such uses or solutions.

-Gert


On 22/01/2012, at 21.31, Serhiy Polyakov wrote:

> That’s great about upcoming improved control option!
> 
> Just curious about retrospect and best practice. Do you know if
> developers were at all utilizing/implementing full text indexing for
> the non-PDF datastreams (like MS Word, and other office documents)
> stored in the Fedora repositories before GSearch 2.4? If so probably
> they used some alternative or custom solutions?
> 
> Thank you,
> Serhiy
> 
> 
> 
> 
> On Sun, Jan 22, 2012 at 5:28 AM, Gert Schmeltz Pedersen
> <gerts...@gmail.com> wrote:
>> PDFBox can extract text from PDF files only. Before the inclusion of Tika in 
>> GSearch 2.4, GSearch could not extract text from the other types. Improved 
>> control of the writeLimit in Tika will be released in GSearch during the 
>> coming week, then all the types are available without length restriction.
>> 
>> -Gert
>> 
>> 
>> On 20/01/2012, at 23.38, Serhiy Polyakov wrote:
>> 
>>> I am using GSearch 2.4. If I still want to full-text index very large
>>> documents I understand I can switch from Tika back to PDFBox in the
>>> configuration (getDatastreamFromTika -> getDatastreamText). I also
>>> want to full-text index MSWord, Excel, PowerPoint and other types.
>>> Which component of software will be actually doing extraction from
>>> those file types if I switch to PDFBox?
>>> 
>>> Thanks,
>>> Serhiy
>>> 
>>> 
>>> On Mon, Jan 16, 2012 at 11:31 AM, Gert Schmeltz Pedersen
>>> <gerts...@gmail.com> wrote:
>>>> Re (1) I ingested the ERIC document and find that with 
>>>> getDatastreamFromTika I get an exception, because the document has more 
>>>> than 100000 characters, while with PDFBox directly (getDatastreamText) the 
>>>> document gets indexed. This probably explains, why some of your documents 
>>>> are not indexed, the longest ones. I will investigate, how we can raise 
>>>> that character limit in Tika.
>>>> 
>>>> Re (2) The ERROR in fedora.log comes because your indexing stylesheet 
>>>> tries to index a datastream, which is not present in that object. You can 
>>>> ignore the message, or preferably, change your indexing stylesheet, so 
>>>> that it only tries to index datastreams that are known to exist.
>>>> 
>>>> -Gert
>>>> 
>>>> On 16/01/2012, at 10.16, Serhiy Polyakov wrote:
>>>> 
>>>>> I tested Fedora Generic Search 2.4
>>>>> 
>>>>> (1)
>>>>> Focus was on PDF full text indexing. I found that some PDF document
>>>>> are full text indexed OK but some are not. Those that are not indexed
>>>>> full text can be converted into text using Adobe Acrobat so they are
>>>>> not images. Their metadata is indexed alright in Fedora.
>>>>> 
>>>>> Example of the document that was not full text indexed is from ERIC 
>>>>> database:
>>>>> 
>>>>> "Digest of Education Statistics, 2009. NCES 2010-013"
>>>>> 
>>>>> http://www.eric.ed.gov/ERICWebPortal/search/recordDetails.jsp?ERICExtSearch_SearchValue_0=ED509883&searchtype=keyword&ERICExtSearch_SearchType_0=no&_pageLabel=RecordDetails&accno=ED509883&_nfls=false&source=ae
>>>>> 
>>>>> I looked at the fedoragsearch.daily.log and see that fields like
>>>>> <field name="dsmd_OBJ.Content-Type"> are there for the problem PDF
>>>>> document. However, filed like <field name="ds.OBJ"> is absent.
>>>>> 
>>>>> For other PDF documents that were full test indexed without problems
>>>>> field <field name="ds.OBJ"> was in the fedoragsearch.daily.log
>>>>> 
>>>>> Any suggestion how to fix would help.
>>>>> 
>>>>> 
>>>>> (2)
>>>>> Additionally, for each ingest of any object multiple records starting
>>>>> with the following records are written in the fedora.log:
>>>>> 
>>>>> ERROR 2012-01-16 02:45:43.124 [http-8080-4]
>>>>> (FedoraAPIABindingSOAPHTTPImpl) Error getting datastream dissemination
>>>>> org.fcrepo.server.errors.DatastreamNotFoundException: [DefaulAccess]
>>>>> No datastream could be returned. Either there is no datastream for the
>>>>> digital object "mynamesp:someid" with datastream ID of "QUERY "  OR
>>>>> there are no datastreams that match the specified date/time value of
>>>>> "null "  .
>>>>> ...
>>>>> ...
>>>>> 
>>>>> "mynamesp:someid" is my collection where I ingest objects.
>>>>> 
>>>>> Should I ignore those?
>>>>> 
>>>>> 
>>>>> Thank you,
>>>>> Serhiy
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> RSA(R) Conference 2012
>>>>> Mar 27 - Feb 2
>>>>> Save $400 by Jan. 27
>>>>> Register now!
>>>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>>>> _______________________________________________
>>>>> Fedora-commons-users mailing list
>>>>> Fedora-commons-users@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> RSA(R) Conference 2012
>>>> Mar 27 - Feb 2
>>>> Save $400 by Jan. 27
>>>> Register now!
>>>> http://p.sf.net/sfu/rsa-sfdev2dev2
>>>> _______________________________________________
>>>> Fedora-commons-users mailing list
>>>> Fedora-commons-users@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>> 
>>> ------------------------------------------------------------------------------
>>> Try before you buy = See our experts in action!
>>> The most comprehensive online learning library for Microsoft developers
>>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>>> Metro Style Apps, more. Free future releases when you subscribe now!
>>> http://p.sf.net/sfu/learndevnow-dev2
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> 
>> ------------------------------------------------------------------------------
>> Try before you buy = See our experts in action!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-dev2
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> 
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to