Json and Drill

Sudheesh Katkam Tue, 13 Sep 2016 17:34:49 -0700

Hi Pradeeban,

Can you post the detailed error message?


First set the option:
> SET `exec.errors.verbose` = true;

And then run the query. The detailed output will point us to where the error 
occurred.

Thank you,
Sudheesh

> On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu <[email protected]> 
> wrote:
> 
> Hi,
> I have a complex json data stored in Mongo.
> 
> The data has nested arrays as shown below ****.
> 
> here when I try to access the geometry field in select queries, the Drill
> query fails.
> 
> select camic._id, camic.type, camic.parent_id, camic.randval,
> camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
> camic.bbox, camic.geometry, camic.footprint, camic.properties,
> camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as
> camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1')
> AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>> = 0) AND (camic.y <= 1));
> 
> would fail with a below error:
> 
> 
> 
> 
> 
> *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
> 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*
> 
> However, after removing the geometry from the select query, it would work
> just fine.
> 
> select camic._id, camic.type, camic.parent_id, camic.randval,
> camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
> camic.bbox, camic.footprint, camic.properties, camic.provenance,
> camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE
> ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND
> (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>> = 0) AND (camic.y <= 1));
> 
> 
> I know this is due to the array in the output. I am also aware of the
> commonly suggested options.
> 1. Using the array indexes in the select query: This is *impractical*. I do
> not know how many elements I would have in this geojson - the coordinates.
> It may be millions or as low as 3.
> 
> 2. Flatten keyword: I am using Drill on top of Mongo - and finding an
> interesting case where Drill outperforms certain queries in a distributed
> execution than just using Mongo. Using Flatten basically kills all the
> performance benefits I have with Drill otherwise. Flatten is just plain
> *expensive* operation for the scale of my data (around 48 GB. But I can
> split them into a few GB each).
> 
> Now given that I have explained why I cannot use the commonly suggested
> options to deal with the complex Drill queries involving arrays, what are
> the other alternatives?
> 
> I am thinking of using Protobufs with Drill to serialize the nested arrays.
> Has anyone else tried it before, and any pointers, or anyother suggestions
> on this overall requirement of querying the nested arrays?
> 
> Thank you.
> Regards,
> Pradeeban.
> 
> 
> **** The sample data in Mongo (data anonymized).
>> db.testUAIM2.findOne()
> {
>    "_id" : ObjectId("22"),
>    "bbox" : [
>        0.11,
>        0.33,
>        0.44,
>        0.44
>    ],
>    "*geometry*" : {
>        "type" : "Polygon",
>        "coordinates" : [
>            [
>                [
>                    0.04272915795445442,
>                    0.9368849396705627
>                ],
>                [
>                    0.04272915795445442,
>                    0.9369083046913147
>                ],
>                [
>                    0.042739588767290115,
>                    0.9369083046913147
>                ],
>                [
>                    0.042739588767290115,
>                    0.9369550347328186
>                ],
>                [
>                    0.04275001958012581,
>                    0.9369550347328186
>                ],
>                [
>                    0.04275001958012581,
>                    0.9370251893997192
>                ],
>                [
>                    0.042760446667671204,
>                    0.9370251893997192
>                ],
>                [
>                    0.042760446667671204,
>                    0.9371420741081238
>                ],
>                [
>                    0.04275001958012581,
>                    0.9371420741081238
>                ],
>                [
>                    0.04275001958012581,
>                    0.9372121691703796
>                ],
>                [
>                    0.042739588767290115,
>                    0.9372121691703796
>                ],
>                [
>                    0.042739588767290115,
>                    0.9372823238372803
>                ],
>                [
>                    0.04272915795445442,
>                    0.9372823238372803
>                ],
>                [
>                    0.04272915795445442,
>                    0.9373056888580322
>                ],
>                [
>                    0.04272915795445442,
>                    0.9368849396705627
>                ]
>            ]
>        ]
>    },
>    "footprint" : 5,
>    "properties" : {
>        "scalar_features" : [
>            {
>                "nv" : [
>                    {
>                        "name" : "f",
>                        "value" : 16.7036895751953
>                    }
>                ]
>            }
>        ]
>    }}
> 
> 
> -- 
> Pradeeban Kathiravelu.
> PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> Portugal.
> Biomedical Informatics Software Engineer, Emory University School of
> Medicine.
> 
> Blog: [Llovizna] http://kkpradeeban.blogspot.com/
> LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03

Re: Nested/n-dimensional Arrays in Mongo/Json and Drill

Reply via email to