Re: Analyzers and JSON

Aaron Mefford Fri, 13 Mar 2015 14:31:07 -0700

Well.. I think I may see your issue.

I decoded this string:


L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM=

It is:

/home/aharmon/test/A Plus - Media Plan Summary.xls

Another is:
/home/aharmon/test/A Plus - Summary by Venue.pdf

I think you misunderstand the purpose or how this all fits together.

As I said you must send the contents of the document to ElasticSearch for
indexing.  Sending the file name is not sufficient, unless you are just
hoping to index the file name, but then why all the fuss with the Tika
extension.

Your PHP code needs to read the full binary content of the xls, xlsx or
PDF.  Then base64 encode that full content.  This will be a very large
string, about 33% larger than the original file.  This is done because
Base64 has a safe character set that is acceptable in a JSON document while
the raw binary is not acceptable in a JSON document.

With this understanding, perhaps you will now understand why it has been
suggested this is not the ideal way to do a large volume of documents in
this manner.  It will be more efficient, to run tika locally, build your
JSON, compress your json and then send it to ES.



On Fri, Mar 13, 2015 at 3:05 PM, Austin Harmon <[email protected]>
wrote:

> Hello,
>
> I'm running an instance of elasticsearch 1.3.2 on ubuntu server 14.04 on a
> imac. I have the mapper-attachments plugin installed and elasticsearch gui
> which I'm using for my front end.
>
> It's possible that I am missing something here are all the things I've
> tried so far:
>
> I got the mapper-attachments plugin installed.
> Then I created the index with mapping:
>
> curl -XPUT 'http://localhost:9200/historicdata' -d
> '{"mappings":{"docs":{"properties":{"content":{"type":"attachment"}}}}}'
>
> now I use a php script to take the documents and convert the docs and
> contents to base64
>
> <?php
>
> $root = '/home/aharmon/test';
>
>
> $iters = new RecursiveIteratorIterator(new
> RecursiveDirectoryIterator($root),
> RecursiveIteratorIterator::CHILD_FIRST);
> try {
>  foreach( $iters as $fullFileName => $iter ) {
> $base64 = base64_encode($iter);
> $indexarray = array ("File" => $base64);
> $jsonarray = json_encode($indexarray);
> file_put_contents("/home/aharmon/data.json", $jsonarray, FILE_APPEND);
> }
>   }
> catch (UnexpectedValueException $e) {
> printf("Directory [%s] contained a directory we can not recurse into",
> $root);
> }
>
>
> ?>
>
> Then I take my data.json file and implement the bulk API:
>
> {"index": {"_index": "historicdata", "_type": "docs" } }
> {"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRm"}
> {"index": {"_index": "historicdata", "_type": "docs" } }
>
> {"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
> {"_index": "historicdata", "_type": "docs" } }
> {"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUueGxz"}
> {"_index": "historicdata", "_type": "docs" } }
> {"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0FnZW5jaWVzIE1hc3RlciBMaXN0Lnhsc3g="}
>
> This is in a separate folder called bulk-requests
>
> Then I run this command:
>
> curl -s -XPOST localhost:9200/_bulk --data-binary @bulk-requests; echo
>
> I got a successful message back so it is all indexed.
>
> Then I run this command:
>
> curl -XGET 'http://localhost:9200/historicdata/docs/_search' '{"fields":
> [ "content.content_type" ], "query":{"match":{"content.content_type":"text
> plain"}}}'
>
> {"took":2,
> "timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
> "hits":{"total":2,"max_score":1.0,"hits":[{"_index":"historicdata","_type":"docs","_id":"LMkqzKbyWTGffNtr1mGPZA","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi9-ZXN0L)EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRM"}},
> {"_index":"historicdata","_type":"docs","_id":"GBEIWECwRgiUbYB6pnq7dQ","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
> }]}}
>
> So it is indexing the documents and the search works but the contents
> isn't being decoded from base64. Maybe there is a general rule with base64
> that I don't know that is assumed? I have followed the documentation
> religiously on github and elasticsearch's site. Also when I decode the
> base64 within the php script before I put it into the json array, it all
> says null. These are .xlsx, .xls, and .pdf documents.
>
> Thanks for your help guys, It is greatly appreciated.
>
> Let me know if you need any more information than what I have provided.
>
>
>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAF9vEEq7hGbOjpryy-j7ce%3Dw3KqY5UP75OB-2ab3TTMtFuKrTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Analyzers and JSON

Reply via email to