On 13/03/2015 15:42, Austin Harmon wrote:
Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to
switch over to Apache Solr? As I said I have 2TB of data so it isn't
efficient for me to manually input each document so it can be indexed
with specific JSON. If you have any experience with Solr please let me
know if it would be a good solution to my problem.
Hi Austin,
Solr's SolrCell lets you submit documents in various formats directly to
Solr, which then uses Tika to extract the plain text for indexing.
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
However we don't like this approach as Tika itself can fall over (when
faced with a great big complex PDF for example, I've seen ones that run
to 3000 pages) or just eat up all the resources on your Solr server. So,
we tend to run Tika as part of an external indexing process, written in
Python or Java, that then throws the plain text at Solr. We can then
manage it, restart it etc.
There are many other ways to do this as well of course - here's some
code that we wrote many moons ago which might be helpful:
https://code.google.com/p/flaxcode/source/browse/trunk/flax_filters/README
Cheers
Charlie
thanks,
Austin
On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/
<http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
It will allow you to extract the contents of the documents for
indexing, this is outside of the scope of the ElasticSearch
indexing. A good tool to make these files downloadable is also out
of scope, but I'll answer to what is in scope. You need to put the
files some where that they can be accessed by a URL. Any webserver
is capable of this, of course your needs may very but this isnt the
list for those questions. Once you have a URL that the document can
be accessed by, include that in your indexing of the document so
that you can point to that URL in your search results.
I am sure there are other options out there for extracting the
contents of word documents, Apache Tika is one that is frequently
used for this purpose though.
On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]
<javascript:>> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft
office documents and pdfs and emails. What is the best way to go
about indexing the body of these documents so making the
contents of the document searchable. I tried to use the php
client but that isn't helping and I know there are ways to
convert files in php but is there nothing available that takes
in these types of documents? I tried the file_get_contents
function in php but it only takes in text documents. Also would
you know of a good tool or a method to make the files that are
searched downloadable?
Thanks,
Austin
On Thursday, March 12, 2015 at 12:26:13 PM UTC-5,
[email protected] wrote:
Yes you need to include all the text you want indexed and
searchable as part of the JSON.
How else would you expect ElasticSearch to receive the data?
Regarding large scale production environments, this is why
ElasticSearch scales out.
Aaron
On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin
Harmon wrote:
Hello,
I'm trying to get an understand of the how to have full
text search on the document and have the body of the
document be considered during search. I understand how
to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If
your fields are file name, file size, file path, file
type how do the analyzers get the body of the document.
Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in
all the examples I've seen but that doesn't make sense
for large scale production environments. If someone
could please give me some insight as to how this
process works it would be greatly appreciated.
Thank you,
Austin Harmon
--
You received this message because you are subscribed to a topic
in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
To unsubscribe from this group and all its topics, send an email
to [email protected] <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/550311EF.2030008%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.