On 13/03/2015 15:42, Austin Harmon wrote:
Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to
switch over to Apache Solr? As I said I have 2TB of data so it isn't
efficient for me to manually input each document so it can be indexed
with specific JSON. If you have any experience with Solr please let me
know if it would be a good solution to my problem.

Hi Austin,

Solr's SolrCell lets you submit documents in various formats directly to Solr, which then uses Tika to extract the plain text for indexing.
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

However we don't like this approach as Tika itself can fall over (when faced with a great big complex PDF for example, I've seen ones that run to 3000 pages) or just eat up all the resources on your Solr server. So, we tend to run Tika as part of an external indexing process, written in Python or Java, that then throws the plain text at Solr. We can then manage it, restart it etc.

There are many other ways to do this as well of course - here's some code that we wrote many moons ago which might be helpful:
https://code.google.com/p/flaxcode/source/browse/trunk/flax_filters/README

Cheers

Charlie

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

    Take a look at Apache Tika http://tika.apache.org/
    
<http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
    It will allow you to extract the contents of the documents for
    indexing, this is outside of the scope of the ElasticSearch
    indexing.  A good tool to make these files downloadable is also out
    of scope, but I'll answer to what is in scope.  You need to put the
    files some where that they can be accessed by a URL.  Any webserver
    is capable of this, of course your needs may very but this isnt the
    list for those questions.  Once you have a URL that the document can
    be accessed by, include that in your indexing of the document so
    that you can point to that URL in your search results.

    I am sure there are other options out there for extracting the
    contents of word documents, Apache Tika is one that is frequently
    used for this purpose though.

    On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]
    <javascript:>> wrote:

        Okay so I have a large amount of data 2 TB and its all microsoft
        office documents and pdfs and emails. What is the best way to go
        about indexing the body of these documents so making the
        contents of the document searchable. I tried to use the php
        client but that isn't helping and I know there are ways to
        convert files in php but is there nothing available that takes
        in these types of documents? I tried the file_get_contents
        function in php but it only takes in text documents. Also would
        you know of a good tool or a method to make the files that are
        searched downloadable?

        Thanks,
        Austin


        On Thursday, March 12, 2015 at 12:26:13 PM UTC-5,
        [email protected] wrote:

            Yes you need to include all the text you want indexed and
            searchable as part of the JSON.

            How else would you expect ElasticSearch to receive the data?

            Regarding large scale production environments, this is why
            ElasticSearch scales out.

            Aaron

            On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin
            Harmon wrote:

                Hello,

                I'm trying to get an understand of the how to have full
                text search on the document and have the body of the
                document be considered during search. I understand how
                to do the mapping and use analyzers but what I don't
                understand is how they get the body of the document. If
                your fields are file name, file size, file path, file
                type how do the analyzers get the body of the document.
                Surely you wouldn't have to put the body of every
                document into the JSON, that is how I've seen it done in
                all the examples I've seen but that doesn't make sense
                for large scale production environments. If someone
                could please give me some  insight as to how this
                process works it would be greatly appreciated.

                Thank you,
                Austin Harmon

        --
        You received this message because you are subscribed to a topic
        in the Google Groups "elasticsearch" group.
        To unsubscribe from this topic, visit
        https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe
        
<https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
        To unsubscribe from this group and all its topics, send an email
        to [email protected] <javascript:>.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
        
<https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>.

        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.


--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/550311EF.2030008%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.

Reply via email to