He posted limited details in a separate thread.

"mapper-attachment and base64 encoding"

I was not asserting that it does not work, just that it may not be the best
way to handle "large number of documents".

I suspect there is an issue with encoding or submitting the document.




On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[email protected]> wrote:

> I’m a bit concerned about your « it does not work » statement.
> We have only today 4 opened issues on it:
> https://github.com/elastic/elasticsearch-mapper-attachments/issues
> 1 bug and 3 feature requests.
>
> Could you explain a bit more what is not working? May be I missed
> something.
>
>
>
> --
> *David Pilato* - Developer | Evangelist
> *Elasticsearch.com <http://Elasticsearch.com>*
> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr
> <https://twitter.com/elasticsearchfr> | @scrutmydocs
> <https://twitter.com/scrutmydocs>
>
>
>
>
> Le 13 mars 2015 à 10:49, Austin Harmon <[email protected]> a écrit :
>
> There is a plugin called mapper attachments:
> https://github.com/elastic/elasticsearch-mapper-attachments This plugin
> is supposed to use Tika to index the content of documents but it doesn't
> seem to be working correctly. I base64 encode the documents but it comes
> back as null when I decode it.
> On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
>>
>> Not certain what you are referring to so I expect not. I have used the
>> elasticsearch mappings, but I cant see how those would directly integrate
>> with Tika.
>>
>> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected]>
>> wrote:
>>
>>> Thank you for the information. This going to be very difficult I can
>>> tell. Do you have experience with the mapper attachment?
>>>
>>> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
>>>>
>>>> Your going to have the same issue with SOLR, putting the contents in to
>>>> XML which is even heavier than JSON.
>>>>
>>>> I wish that I had some more experience using Tika, I do not.  I am
>>>> aware of its capabilities but have not had reason to myself.
>>>>
>>>> I see what you are saying about others not having the same issue, but
>>>> what you must realize is that most users are not indexing that type of
>>>> document.  They are indexing events, database records, web pages and so
>>>> on.  It is a very small subset that index things like word docs and pdfs.
>>>>
>>>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you for the information. I've been trying to use the mapper
>>>>> attachment which has Apache Tika built into it. I am just surprised and
>>>>> confused that so many companies use elasticsearch but yet it is so
>>>>> difficult to index the contents of a document. If I need to index the
>>>>> contents of documents then would it be easier and more efficient to switch
>>>>> over to Apache Solr? As I said I have 2TB of data so it isn't efficient 
>>>>> for
>>>>> me to manually input each document so it can be indexed with specific 
>>>>> JSON.
>>>>> If you have any experience with Solr please let me know if it would be a
>>>>> good solution to my problem.
>>>>>
>>>>> thanks,
>>>>> Austin
>>>>>
>>>>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
>>>>>>
>>>>>> Take a look at Apache Tika http://tika.apache.org/
>>>>>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
>>>>>> It will allow you to extract the contents of the documents for indexing,
>>>>>> this is outside of the scope of the ElasticSearch indexing.  A good tool 
>>>>>> to
>>>>>> make these files downloadable is also out of scope, but I'll answer to 
>>>>>> what
>>>>>> is in scope.  You need to put the files some where that they can be
>>>>>> accessed by a URL.  Any webserver is capable of this, of course your 
>>>>>> needs
>>>>>> may very but this isnt the list for those questions.  Once you have a URL
>>>>>> that the document can be accessed by, include that in your indexing of 
>>>>>> the
>>>>>> document so that you can point to that URL in your search results.
>>>>>>
>>>>>> I am sure there are other options out there for extracting the
>>>>>> contents of word documents, Apache Tika is one that is frequently used 
>>>>>> for
>>>>>> this purpose though.
>>>>>>
>>>>>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay so I have a large amount of data 2 TB and its all microsoft
>>>>>>> office documents and pdfs and emails. What is the best way to go about
>>>>>>> indexing the body of these documents so making the contents of the 
>>>>>>> document
>>>>>>> searchable. I tried to use the php client but that isn't helping and I 
>>>>>>> know
>>>>>>> there are ways to convert files in php but is there nothing available 
>>>>>>> that
>>>>>>> takes in these types of documents? I tried the file_get_contents 
>>>>>>> function
>>>>>>> in php but it only takes in text documents. Also would you know of a 
>>>>>>> good
>>>>>>> tool or a method to make the files that are searched downloadable?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Austin
>>>>>>>
>>>>>>>
>>>>>>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected]
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes you need to include all the text you want indexed and
>>>>>>>> searchable as part of the JSON.
>>>>>>>>
>>>>>>>> How else would you expect ElasticSearch to receive the data?
>>>>>>>>
>>>>>>>> Regarding large scale production environments, this is why
>>>>>>>> ElasticSearch scales out.
>>>>>>>>
>>>>>>>> Aaron
>>>>>>>>
>>>>>>>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm trying to get an understand of the how to have full text
>>>>>>>>> search on the document and have the body of the document be considered
>>>>>>>>> during search. I understand how to do the mapping and use analyzers 
>>>>>>>>> but
>>>>>>>>> what I don't understand is how they get the body of the document. If 
>>>>>>>>> your
>>>>>>>>> fields are file name, file size, file path, file type how do the 
>>>>>>>>> analyzers
>>>>>>>>> get the body of the document. Surely you wouldn't have to put the 
>>>>>>>>> body of
>>>>>>>>> every document into the JSON, that is how I've seen it done in all the
>>>>>>>>> examples I've seen but that doesn't make sense for large scale 
>>>>>>>>> production
>>>>>>>>> environments. If someone could please give me some  insight as to how 
>>>>>>>>> this
>>>>>>>>> process works it would be greatly appreciated.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Austin Harmon
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>> the Google Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>>> [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e
>>>>>>> 3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "elasticsearch" group.
>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>> pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>> topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr
> <https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to