For those interested, I have committed a branch at github for this issue, see 
it at https://github.com/fcrepo/gsearch/tree/fcrepo-1010

It uses tika 0.10, but will be upgraded to tika 1.0 when released (it is 
expected in November 2011).
The branch adds two functions to GenericOperationsImpl:
- getDatastreamFromTika: retrieves the text only
- getDatastreamFromTikaWithMetadata: retrieves metadata also
The branch comes with a test suite in gsearch.test.fgs24_1010, where the two 
functions are tested on both Lucene and Solr.
The tests have docx, doc, and pdf datastreams, but potentially all the Tika 
formats are available, since the branch uses AutoDetectParser in Tika.
Feedback is welcome.

-Gert

PS: All GSearch issues are at at 
https://jira.duraspace.org/secure/IssueNavigator.jspa?mode=hide&requestId=10311 
.
GSearch 2.3 is at 
https://wiki.duraspace.org/display/FCSVCS/Generic+Search+Service+2.3 .
The source of GSearch 2.3 is at https://github.com/fcrepo/gsearch .



On 12/10/2011, at 16.24, <aj...@virginia.edu> <aj...@virginia.edu> wrote:

> I've offered one straightforward possibility (one that was discussed briefly 
> in Austin) at:
> 
> https://jira.duraspace.org/browse/FCREPO-1010
> 
> Use Apache Tika for extraction:
> Apache Tika is a toolkit that can extract text and metadata from a wide 
> variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as 
> an extraction engine in GSearch would immediately expand enormously the 
> possible range of material over which GSearch could operate, and going 
> forward, GSearch would benefit from new parsers and better-performing parsers 
> created as part of that effort.
> 
> 
> 
> ---
> A. Soroka
> Online Library Environment
> the University of Virginia Library
> 
> 
> 
> 
> On Oct 12, 2011, at 10:07 AM, Gert Schmeltz Pedersen wrote:
> 
>> This message is meant to open for a discussion of the roadmap for GSearch. 
>> It started in a small group, but we invite participation from the wider 
>> group of fedora-developers. I copy this message to the fedora-users list so 
>> that GSearch users are informed about the discussion, but to follow it 
>> onwards and to contribute they have to subscribe to the fedora-developers 
>> list.
>> 
>> I will initiate the discussion with a status. GSearch 2.2 has been the 
>> current release since December 2008. At OR2011 in Austin in June 2011 I 
>> presented a plan for development of GSearch, see 
>> https://conferences.tdl.org/or/OR2011/OR2011main/paper/view/416/127 . 
>> Following that, I have provided GSearch 2.3, and the official release is 
>> near. You can get the source at https://github.com/fcrepo/gsearch and 
>> fedoragsearch.war from the DTU prerelease site at 
>> http://www.cvt.dk/fedoragsearch/ and see the documentation page at 
>> http://miranth.cvt.dk/fedoragsearch/ .
>> 
>> Next step in the plan is to provide GSearch 2.4 by the end of the year. I 
>> will use the issue tracker at 
>> https://jira.duraspace.org/secure/IssueNavigator.jspa?mode=hide&requestId=10311
>>  to track the work, and I invite your feedback and contributions. Potential 
>> committers may be enrolled, I already had some responses to my invitation to 
>> potential committers at OR2011. Some of you may have heard at OR2011, that I 
>> will retire by the end of the year. However, I will continue part-time to 
>> support GSearch users on the fedora-users list and continue to develop for 
>> GSearch and Fedora in partnerships with people, who have an interest in that.
>> 
>> The post-2.4 roadmap discussion can both be on this list and as new or 
>> modified issues at the issue tracker. I think that members of the initial 
>> small group will soon bring up issues.
>> 
>> Gert
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure contains a
>> definitive record of customers, application performance, security
>> threats, fraudulent activity and more. Splunk takes this data and makes
>> sense of it. Business sense. IT sense. Common sense.
>> http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________
>> Fedora-commons-developers mailing list
>> fedora-commons-develop...@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> Fedora-commons-developers mailing list
> fedora-commons-develop...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to