I've offered one straightforward possibility (one that was discussed briefly in Austin) at:
https://jira.duraspace.org/browse/FCREPO-1010 Use Apache Tika for extraction: Apache Tika is a toolkit that can extract text and metadata from a wide variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as an extraction engine in GSearch would immediately expand enormously the possible range of material over which GSearch could operate, and going forward, GSearch would benefit from new parsers and better-performing parsers created as part of that effort. --- A. Soroka Online Library Environment the University of Virginia Library On Oct 12, 2011, at 10:07 AM, Gert Schmeltz Pedersen wrote: > This message is meant to open for a discussion of the roadmap for GSearch. It > started in a small group, but we invite participation from the wider group of > fedora-developers. I copy this message to the fedora-users list so that > GSearch users are informed about the discussion, but to follow it onwards and > to contribute they have to subscribe to the fedora-developers list. > > I will initiate the discussion with a status. GSearch 2.2 has been the > current release since December 2008. At OR2011 in Austin in June 2011 I > presented a plan for development of GSearch, see > https://conferences.tdl.org/or/OR2011/OR2011main/paper/view/416/127 . > Following that, I have provided GSearch 2.3, and the official release is > near. You can get the source at https://github.com/fcrepo/gsearch and > fedoragsearch.war from the DTU prerelease site at > http://www.cvt.dk/fedoragsearch/ and see the documentation page at > http://miranth.cvt.dk/fedoragsearch/ . > > Next step in the plan is to provide GSearch 2.4 by the end of the year. I > will use the issue tracker at > https://jira.duraspace.org/secure/IssueNavigator.jspa?mode=hide&requestId=10311 > to track the work, and I invite your feedback and contributions. Potential > committers may be enrolled, I already had some responses to my invitation to > potential committers at OR2011. Some of you may have heard at OR2011, that I > will retire by the end of the year. However, I will continue part-time to > support GSearch users on the fedora-users list and continue to develop for > GSearch and Fedora in partnerships with people, who have an interest in that. > > The post-2.4 roadmap discussion can both be on this list and as new or > modified issues at the issue tracker. I think that members of the initial > small group will soon bring up issues. > > Gert > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________ > Fedora-commons-developers mailing list > fedora-commons-develop...@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ Fedora-commons-users mailing list Fedora-commons-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-users