This is a known issue, we have been describing this schema change scenario
as a change in "data shape". Here is a very simple dataset that will
produce the same error
{ "a" : 1 }
{ "a" : { "b" : 3} }
We currently only use all_text_mode to change the type of the data at a
leaf in the schema.
There is an argument that we could have implemented all_text_mode to read
this file and transform this data into this representation, where complex
data is stored in text blobs as JSON:
{ "a" : "1" }
{ "a" : "{\"b\" : 3}" }
Unfortunately, if this came out of the read, we would need to force users
to use the covert_from(col, 'JSON') function to access the fields that are
nested below the root level of the schema.
We decided not to do this at the time all_text_mode was implemented. I do
not believe we want to provide a workaround this right now as the work on
embedded type, which will enable native support for all kinds of schema
changes is ongoing.
https://issues.apache.org/jira/browse/DRILL-3228
On Thu, Sep 10, 2015 at 11:46 AM, Khurram Faraaz <[email protected]>
wrote:
> Hi,
>
>
> Querying nested data from a JSON file with and with out setting
> store.json.all_text_mode
> results in errors.
>
>
> Data that was used in the test is available here -
>
> https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5
>
> There is an option (on the top right) to Export as JSON.
>
> Drill commit ID : 0686bc23 (on master)
>
>
> [root@centos-01 ~]# hadoop fs -ls /tmp/rows.json
>
> -rwxr-xr-x 3 root root 156417523 2015-09-10 17:18 /tmp/rows.json
>
>
> [root@centos-01 food_inspection_data]# hadoop fs -put rows.json /tmp
>
> [root@centos-01 food_inspection_data]# cd /opt/mapr/drill/drill-1.2.0/bin
>
> [root@centos-01 bin]# ./sqlline -u "jdbc:drill:schema=dfs.tmp"
>
> apache drill 1.0.0
>
> "a little sql for your nosql"
>
>
> Query 1)
>
>
> 0: jdbc:drill:schema=dfs.tmp> select * from `rows.json`;
>
> Error: DATA_READ ERROR: You tried to start when you are using a ValueWriter
> of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Line 1023
>
> Column 29
>
> Field smallest
>
> Fragment 0:0
>
>
> [Error Id: 0844933a-ed61-4137-9970-203576d5770d on centos-02.qa.lab:31010]
> (state=,code=0)
>
>
> *-----*
>
>
> Query 2) Drill seems to be checking for correctness of data much deeper in
> the nesting. Note that I do a count over field* meta, *and field*
> "smallest" *reported in error is much deeper inside* meta, *
> meta.view.columns.cachedContents.smallest
>
>
> 0: jdbc:drill:schema=dfs.tmp> select count(meta) from `rows.json`;
>
> Error: DATA_READ ERROR: You tried to start when you are using a ValueWriter
> of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Line 1023
>
> Column 29
>
> Field smallest
>
> Fragment 0:0
>
>
> [Error Id: 66e7f461-6671-4862-bde5-685c8c2304c9 on centos-04.qa.lab:31010]
> (state=,code=0)
>
>
> Explain plan for Query (2)
>
>
> 0: jdbc:drill:schema=dfs.tmp> explain plan for select count(meta) from
> `rows.json`;
>
> *+------+------+*
>
> *| **text** | **json** |*
>
> *+------+------+*
>
> *| *00-00 Screen
>
> 00-01 Project(EXPR$0=[$0])
>
> 00-02 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
>
> 00-03 StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
>
> 00-04 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/rows.json, numFiles=1, columns=[`meta`],
> files=[maprfs:///tmp/rows.json]]])
>
>
> Query 3)
>
>
> 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view) from `rows.json` t;
>
> Error: DATA_READ ERROR: You tried to start when you are using a ValueWriter
> of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Line 1023
>
> Column 29
>
> Field smallest
>
> Fragment 0:0
>
>
> [Error Id: 55ea650c-9587-42d8-aec9-0f5a4b77ee4c on centos-04.qa.lab:31010]
> (state=,code=0)
>
>
> Query 4) - The count returned is wrong!
>
>
> 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view.id) from
> `rows.json`
> t;
>
> *+---------+*
>
> *| **EXPR$0 ** |*
>
> *+---------+*
>
> *| *1 * |*
>
> *+---------+*
>
> 1 row selected (1.57 seconds)
>
> -------------------
>
> The below queries were run after json all text mode configuration was
> enabled.
>
> 0: jdbc:drill:schema=dfs.tmp> alter session set
> `store.json.all_text_mode`=true;
>
> *+-------+------------------------------------+*
>
> *| ** ok ** | ** summary ** |*
>
> *+-------+------------------------------------+*
>
> *| *true * | *store.json.all_text_mode updated. * |*
>
> *+-------+------------------------------------+*
>
> 1 row selected (0.218 seconds)
>
>
> 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view) from `rows.json` t;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> are using a ValueWriter of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Fragment 0:0
>
>
> [Error Id: 5eeee0f3-69fd-445c-a149-a3aa2f18d1cb on centos-04.qa.lab:31010]
> (state=,code=0)
>
>
> COUNT returned in below query is wrong.
>
>
> 0: jdbc:drill:schema=dfs.tmp> select count(t.meta.view.id) from
> `rows.json`
> t;
>
> +---------+
>
> | EXPR$0 |
>
> +---------+
>
> | 1 |
>
> +---------+
>
> 1 row selected (0.948 seconds)
>
>
> Query to perform count over nested data.
>
>
> 0: jdbc:drill:schema=dfs.tmp> select
> count(t.meta.view.columns.cachedContents.smallest) from `rows.json` t;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> are using a ValueWriter of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Fragment 0:0
>
>
> [Error Id: 4ce0d5fe-abd7-47d9-8106-269fe37409e6 on centos-04.qa.lab:31010]
> (state=,code=0)
>
>
> 0: jdbc:drill:schema=dfs.tmp> select
> t.meta.view.columns.cachedContents.smallest from `rows.json` t;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> are using a ValueWriter of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Fragment 0:0
>
>
> [Error Id: b2514414-af51-46c5-938d-ff792e6a8a8c on centos-04.qa.lab:31010]
> (state=,code=0)
>
> 0: jdbc:drill:schema=dfs.tmp> select * from `rows.json`;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you
> are using a ValueWriter of type NullableVarCharWriterImpl.
>
>
> File /tmp/rows.json
>
> Record 1
>
> Fragment 0:0
>
>
> [Error Id: 3615e088-64cc-4cba-8cfe-602563f4e16c on centos-04.qa.lab:31010]
> (state=,code=0)
>
> These are the occorrences of field named *smallest* in the JSON file
>
> Administrators-MacBook-Pro:Downloads kfaraaz$ grep "smallest" rows.json
>
> "smallest" : "44247",
>
> "smallest" : "'C' A CATERING AND EVENT COMPANY",
>
> "smallest" : "'C' A CATERING AND EVENT COMPANY",
>
> "smallest" : "0",
>
> "smallest" : "(convenience store)",
>
> "smallest" : "All",
>
> "smallest" : " ",
>
> "smallest" : "alsip",
>
> "smallest" : "IL",
>
> "smallest" : "60007",
>
> "smallest" : "2010-01-04T00:00:00",
>
> "smallest" : "1315 license reinspection",
>
> "smallest" : "Business Not Located",
>
> "smallest" : "1. SOURCE SOUND CONDITION, NO SPOILAGE, FOODS
> PROPERLY LABELED, SHELLFISH TAGS IN PLACE - Comments: ",
>
> "smallest" : "41.64467013219805",
>
> "smallest" : "-87.91442843927047",
>
> "smallest" : {
>
> , [ 101045, "180563EF-F00C-41F9-A0C2-7B2100AD2DED", 101045, 1441883778,
> "386464", 1441883778, "386464", null, "343208", "SUBWAY # 26817-309 W.
> MONROE", "SUBWAY", "1519940", "Restaurant", "Risk 1 (High)", "309 W MONROE
> ST ", "CHICAGO", "IL", "60606", "2010-07-13T00:00:00", "Canvass", "Pass",
> "32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND
> MAINTAINED - Comments: All food and non-food contact equipment and utensils
> shall be smooth, easily cleanable, and durable, and shall be in good
> repair. A few of the smallest cutting boards, with deep/dark grooves must
> be sanded/bleached- to make smooth/easily cleanable. | 33. FOOD AND
> NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS -
> Comments: All food and non-food contact surfaces of equipment and all food
> storage utensils shall be thoroughly cleaned and sanitized daily. Interior
> of ice chute at soda dispensers need cleaning. | 34. FLOORS: CONSTRUCTED
> PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING
> METHODS USED - Comments: The floors shall be constructed per code, be
> smooth and easily cleaned, and be kept clean and in good repair. Floors of
> walk-in cooler (around corners) need cleaning; baseboard at the outside
> base of the walk-in cooler door is in poor repair, must fix.",
> "41.88052826494154", "-87.63556938091433", [ null, "41.88052826494154",
> "-87.63556938091433", null, false ] ]
>
> Thanks,
> Khurram
>