Re: Analyzers and JSON

David Pilato Fri, 13 Mar 2015 12:46:27 -0700

Thanks. I missed the post.
Will answer there.

-- 
David Pilato - Developer | Evangelist 
Elasticsearch.com
@dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr 
<https://twitter.com/elasticsearchfr> | @scrutmydocs 
<https://twitter.com/scrutmydocs>





> Le 13 mars 2015 à 12:41, Aaron Mefford <[email protected]> a écrit :
> 
> He posted limited details in a separate thread.
> 
> "mapper-attachment and base64 encoding"
> 
> I was not asserting that it does not work, just that it may not be the best 
> way to handle "large number of documents".
> 
> I suspect there is an issue with encoding or submitting the document.
> 
> 
> 
> 
> On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <[email protected] 
> <mailto:[email protected]>> wrote:
> I’m a bit concerned about your « it does not work » statement.
> We have only today 4 opened issues on it: 
> https://github.com/elastic/elasticsearch-mapper-attachments/issues 
> <https://github.com/elastic/elasticsearch-mapper-attachments/issues>
> 1 bug and 3 feature requests.
> 
> Could you explain a bit more what is not working? May be I missed something.
> 
> 
> 
> -- 
> David Pilato - Developer | Evangelist 
> Elasticsearch.com <http://elasticsearch.com/>
> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr 
> <https://twitter.com/elasticsearchfr> | @scrutmydocs 
> <https://twitter.com/scrutmydocs>
> 
> 
> 
> 
>> Le 13 mars 2015 à 10:49, Austin Harmon <[email protected] 
>> <mailto:[email protected]>> a écrit :
>> 
>> There is a plugin called mapper attachments: 
>> https://github.com/elastic/elasticsearch-mapper-attachments 
>> <https://github.com/elastic/elasticsearch-mapper-attachments> This plugin is 
>> supposed to use Tika to index the content of documents but it doesn't seem 
>> to be working correctly. I base64 encode the documents but it comes back as 
>> null when I decode it. 
>> On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
>> Not certain what you are referring to so I expect not. I have used the 
>> elasticsearch mappings, but I cant see how those would directly integrate 
>> with Tika.
>> 
>> On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <[email protected] <>> 
>> wrote:
>> Thank you for the information. This going to be very difficult I can tell. 
>> Do you have experience with the mapper attachment?
>> 
>> On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
>> Your going to have the same issue with SOLR, putting the contents in to XML 
>> which is even heavier than JSON.
>> 
>> I wish that I had some more experience using Tika, I do not.  I am aware of 
>> its capabilities but have not had reason to myself.  
>> 
>> I see what you are saying about others not having the same issue, but what 
>> you must realize is that most users are not indexing that type of document.  
>> They are indexing events, database records, web pages and so on.  It is a 
>> very small subset that index things like word docs and pdfs.
>> 
>> On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <[email protected] <>> wrote:
>> Thank you for the information. I've been trying to use the mapper attachment 
>> which has Apache Tika built into it. I am just surprised and confused that 
>> so many companies use elasticsearch but yet it is so difficult to index the 
>> contents of a document. If I need to index the contents of documents then 
>> would it be easier and more efficient to switch over to Apache Solr? As I 
>> said I have 2TB of data so it isn't efficient for me to manually input each 
>> document so it can be indexed with specific JSON. If you have any experience 
>> with Solr please let me know if it would be a good solution to my problem. 
>> 
>> thanks,
>> Austin
>> 
>> On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
>> Take a look at Apache Tika http://tika.apache.org/ 
>> <http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
>>   It will allow you to extract the contents of the documents for indexing, 
>> this is outside of the scope of the ElasticSearch indexing.  A good tool to 
>> make these files downloadable is also out of scope, but I'll answer to what 
>> is in scope.  You need to put the files some where that they can be accessed 
>> by a URL.  Any webserver is capable of this, of course your needs may very 
>> but this isnt the list for those questions.  Once you have a URL that the 
>> document can be accessed by, include that in your indexing of the document 
>> so that you can point to that URL in your search results.
>> 
>> I am sure there are other options out there for extracting the contents of 
>> word documents, Apache Tika is one that is frequently used for this purpose 
>> though.
>> 
>> On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <[email protected] <>> wrote:
>> Okay so I have a large amount of data 2 TB and its all microsoft office 
>> documents and pdfs and emails. What is the best way to go about indexing the 
>> body of these documents so making the contents of the document searchable. I 
>> tried to use the php client but that isn't helping and I know there are ways 
>> to convert files in php but is there nothing available that takes in these 
>> types of documents? I tried the file_get_contents function in php but it 
>> only takes in text documents. Also would you know of a good tool or a method 
>> to make the files that are searched downloadable?
>> 
>> Thanks,
>> Austin
>> 
>> 
>> On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, [email protected] <> 
>> wrote:
>> Yes you need to include all the text you want indexed and searchable as part 
>> of the JSON.
>> 
>> How else would you expect ElasticSearch to receive the data?
>> 
>> Regarding large scale production environments, this is why ElasticSearch 
>> scales out.
>> 
>> Aaron
>> 
>> On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
>> Hello,
>> 
>> I'm trying to get an understand of the how to have full text search on the 
>> document and have the body of the document be considered during search. I 
>> understand how to do the mapping and use analyzers but what I don't 
>> understand is how they get the body of the document. If your fields are file 
>> name, file size, file path, file type how do the analyzers get the body of 
>> the document. Surely you wouldn't have to put the body of every document 
>> into the JSON, that is how I've seen it done in all the examples I've seen 
>> but that doesn't make sense for large scale production environments. If 
>> someone could please give me some  insight as to how this process works it 
>> would be greatly appreciated.
>> 
>> Thank you,
>> Austin Harmon
>> 
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe 
>> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> 
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe 
>> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> 
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe 
>> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected] <>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> 
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] 
>> <mailto:[email protected]>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer>.
>> For more options, visit https://groups.google.com/d/optout 
>> <https://groups.google.com/d/optout>.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe 
> <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr
>  
> <https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer>.
> 
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5F337472-7F1B-462F-A9A2-A617D6F4536A%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Re: Analyzers and JSON

Reply via email to