-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

It uses Apache Tika, and to my understanding, extracts whatever Tika can 
extract. But I invite comment from Gert so we can be sure about that.

If you mean the Java source files by which Tika extraction is made available as 
an XSLT extension function, they are here:

https://github.com/fcrepo/gsearch/blob/master/FedoraGenericSearch/src/java/dk/defxws/fedoragsearch/server/GenericOperationsImpl.java

and for text extraction, here:

https://github.com/fcrepo/gsearch/blob/master/FedoraGenericSearch/src/java/dk/defxws/fedoragsearch/server/TransformerToText.java

- ---
A. Soroka
The University of Virginia Library

On Jul 24, 2013, at 8:49 AM, Alistair Young wrote:

> I can see how it 's useful but with it in, I have a jpeg file that can't
> be indexed. What sort of technical assertions does it extract/infer? I
> could see if there's something strange in the image file.
> 
> Alternately, what's the source file and I'll have a look...
> 
> Alistair
> 
> -- 
> mov eax,1
> mov ebx,0
> int 80h
> 
> 
> 
> 
> On 24/07/2013 13:42, "aj...@virginia.edu" <aj...@virginia.edu> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> I was one of the people who instigated Gert to add that functionality.The
>> motivation is to be able to extract technical assertions about binary
>> datastreams and use them in indexing. It's not extracting content from
>> images, although it could extract content from PDF files or other
>> text-containing formats.
>> 
>> On perhaps a more useful note, you should definitely expect to alter the
>> default indexing stylesheets, or even better, to create your own that are
>> to your particular purposes.
>> 
>> - ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 24, 2013, at 8:32 AM, Alistair Young wrote:
>> 
>>> sorted it by removing the Apache Tika extraction from:
>>> 
>>> WEB-INF/classes/fgsconfigFinal/index/FgsIndex/foxmlToSolrGenerated.xslt
>>> 
>>> it seems it extracts the content and tries to index it. Not sure why it
>>> would want to extract the content of an image but when it does it causes
>>> Solr to fail to index the resource:
>>> 
>>> SEVERE: org.apache.solr.common.SolrException: Illegal character (NULL,
>>> unicode 0) encountered: not valid in any content
>>> 
>>> Seems to only think some jpg files are not jpg files.
>>> 
>>> Alistair
>>> 
>>> -- 
>>> mov eax,1
>>> mov ebx,0
>>> int 80h
>>> 
>>> From: Alistair Young <alistair.yo...@uhi.ac.uk>
>>> Reply-To: "Support and info exchange list for Fedora users."
>>> <fedora-commons-users@lists.sourceforge.net>
>>> Date: Wednesday, 24 July 2013 11:03
>>> To: "Support and info exchange list for Fedora users."
>>> <fedora-commons-users@lists.sourceforge.net>
>>> Subject: Re: [fcrepo-user] Does gsearch index content with solr?
>>> 
>>> sorry should have mentioned, it's the content datastream, i.e.
>>> image/jpeg
>>> 
>>> Alistair
>>> 
>>> -- 
>>> mov eax,1
>>> mov ebx,0
>>> int 80h
>>> 
>>> From: Alistair Young <alistair.yo...@uhi.ac.uk>
>>> Reply-To: "Support and info exchange list for Fedora users."
>>> <fedora-commons-users@lists.sourceforge.net>
>>> Date: Wednesday, 24 July 2013 10:59
>>> To: "Support and info exchange list for Fedora users."
>>> <fedora-commons-users@lists.sourceforge.net>
>>> Subject: [fcrepo-user] Does gsearch index content with solr?
>>> 
>>> I have a weird problem. I dropped a foxml file into
>>> FgsConfig/indexingXsltGenerator/foxml and configured etc but certain
>>> files, when uploaded cause solr to crash:
>>> 
>>> SEVERE: org.apache.solr.common.SolrException: Illegal character (NULL,
>>> unicode 0) encountered: not valid in any content
>>> 
>>> If I don't include datastream in the foxml it doesn't cause the crash,
>>> i.e. remove this:
>>> 
>>> <foxml:datastream ID="AUDIT" STATE="A" CONTROL_GROUP="X"
>>> VERSIONABLE="false">
>>> 
>>> Should the foxml used to configure gsearch only contain 'metadata',
>>> i.e. DC, RDF etc and not datastreams?
>>> 
>>> thanks,
>>> 
>>> Alistair
>>> 
>>> 
>>> -------------------------------------------------------------------------
>>> -----
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>> 
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clkt
>>> rk_______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>> Comment: GPGTools - http://gpgtools.org
>> 
>> iQEcBAEBAgAGBQJR78vHAAoJEATpPYSyaoIk8dsIALihgJB0b4OABcOcOnk2qthk
>> 79JqHouayvOFwTNMHsHZMIPXQ9KlD7h/zrHVYPPOqXV8fvNb3+EeQEal5WJxs4Z3
>> mMevFpEpBlOWUOBAiEqayNNfnxNCGQ3ARCRXNzeiaheM43ouFCluOGkX9p3fjqSV
>> qq6QG862vDFvYF69rMH1NiFIUIA/QP8w/K/QzyI8qoblrzWCX2LmQ8NaH5b0oN1j
>> Nb0NXIQv+XOVJZeHFvbHNEzGMGMEWHKs2QsZ1auirOKaO3ccV74+gVTuvDkmmuXL
>> VjQQoxNBTqbkhSpoDsWPCkHE+fVGuWyFS/ffJQ/0heX1rWOkiOFgJhhGuwJOl2Y=
>> =s4aM
>> -----END PGP SIGNATURE-----
>> 
>> --------------------------------------------------------------------------
>> ----
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktr
>> k
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
> 
> 
> 
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJR79CgAAoJEATpPYSyaoIkplQH/RU81U3ekHrGhHDMf6Qn+R5k
aiRSat1jZAQvCgND12GYmWywn9ap4mouDOiN8b4o+881HUUVClcFpgGueQQ3eK3d
VCyhmWs4inO2rMz8RTNYWDwYfvBAB9qk4Ji6gSj2bM+VnTV6F64LuJRnhToqbVl+
3cLyTZwAFCgb9GHUuo8jPYomCFpSMKvA/Ohc5z5DXvw9HnHVF2AD2pM/3i5wTl84
zJvgtK/SWCD6HvBZwQbUmXTne9O6h8hHMEZOTG5szxDyQhFAmj4cQChXXxG2u+sI
Z5XZ7Ook43A/iVVM/0XoP7bwoM/uaUPpjlg0iAI0Ekk60BV+0InmCRgKVtwBY+Q=
=GcmN
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to