Hi,
I have a complex json data stored in Mongo.

The data has nested arrays as shown below ****.

here when I try to access the geometry field in select queries, the Drill
query fails.

 select camic._id, camic.type, camic.parent_id, camic.randval,
camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
camic.bbox, camic.geometry, camic.footprint, camic.properties,
camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as
camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1')
AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
(camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>= 0) AND (camic.y <= 1));

would fail with a below error:





*Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*

However, after removing the geometry from the select query, it would work
just fine.

 select camic._id, camic.type, camic.parent_id, camic.randval,
camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
camic.bbox, camic.footprint, camic.properties, camic.provenance,
camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE
((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND
(camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
(camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>= 0) AND (camic.y <= 1));


I know this is due to the array in the output. I am also aware of the
commonly suggested options.
1. Using the array indexes in the select query: This is *impractical*. I do
not know how many elements I would have in this geojson - the coordinates.
It may be millions or as low as 3.

2. Flatten keyword: I am using Drill on top of Mongo - and finding an
interesting case where Drill outperforms certain queries in a distributed
execution than just using Mongo. Using Flatten basically kills all the
performance benefits I have with Drill otherwise. Flatten is just plain
*expensive* operation for the scale of my data (around 48 GB. But I can
split them into a few GB each).

Now given that I have explained why I cannot use the commonly suggested
options to deal with the complex Drill queries involving arrays, what are
the other alternatives?

I am thinking of using Protobufs with Drill to serialize the nested arrays.
Has anyone else tried it before, and any pointers, or anyother suggestions
on this overall requirement of querying the nested arrays?

Thank you.
Regards,
Pradeeban.


**** The sample data in Mongo (data anonymized).
> db.testUAIM2.findOne()
{
    "_id" : ObjectId("22"),
    "bbox" : [
        0.11,
        0.33,
        0.44,
        0.44
    ],
    "*geometry*" : {
        "type" : "Polygon",
        "coordinates" : [
            [
                [
                    0.04272915795445442,
                    0.9368849396705627
                ],
                [
                    0.04272915795445442,
                    0.9369083046913147
                ],
                [
                    0.042739588767290115,
                    0.9369083046913147
                ],
                [
                    0.042739588767290115,
                    0.9369550347328186
                ],
                [
                    0.04275001958012581,
                    0.9369550347328186
                ],
                [
                    0.04275001958012581,
                    0.9370251893997192
                ],
                [
                    0.042760446667671204,
                    0.9370251893997192
                ],
                [
                    0.042760446667671204,
                    0.9371420741081238
                ],
                [
                    0.04275001958012581,
                    0.9371420741081238
                ],
                [
                    0.04275001958012581,
                    0.9372121691703796
                ],
                [
                    0.042739588767290115,
                    0.9372121691703796
                ],
                [
                    0.042739588767290115,
                    0.9372823238372803
                ],
                [
                    0.04272915795445442,
                    0.9372823238372803
                ],
                [
                    0.04272915795445442,
                    0.9373056888580322
                ],
                [
                    0.04272915795445442,
                    0.9368849396705627
                ]
            ]
        ]
    },
    "footprint" : 5,
    "properties" : {
        "scalar_features" : [
            {
                "nv" : [
                    {
                        "name" : "f",
                        "value" : 16.7036895751953
                    }
                ]
            }
        ]
    }}


-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03

Reply via email to