[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305661#comment-16305661
 ] 

Paul Rogers commented on DRILL-6035:
------------------------------------

h4. Record and Batch Sizes

Prior to Drill 1.13, the JSON reader always reads 4096 records per batch. If 
the input records are small (just an integer, say), then the batches will be 
very small and Drill may not run efficiently. For example, the following will 
lead to batches just 32K bytes in size:

{code}
{a: 10} {a: 20} ...
{code}

However, if the incoming records are large (perhaps they contain the entire 
history of an support issue or forum post), then batches can become far too 
large. If a record contains, say, an array of 1000 strings of 1K bytes each:

{code}
{comments: ["first...", "second...", ... ]}
{code}

Then the total batch size is 4K * 1000 * 1K = ~4GB, which is far too large for 
Drill to process.

In Drill 1.12, the number of records per batch is fixed. The only thing that 
can vary is for the user to design their JSON to fit within Drill's fixed 
record count. A good rule of thumb is to limit each record (including the total 
size of any arrays within the record) to around 1K bytes, resulting a roughly 
40 MB batch. (Even this may be too large for the sort to handle on some 
configurations. On systems with many processors, the sort often receives only 
30 or so MB, and so wants batch sizes in the range of about 10 MB. This is not 
ideal, it just happens to be the way the code works.)

In Drill 1.13, Drill will determine the number of records to read per batch to 
keep the batch size within a target size that Drill defines.

> Specify Drill's JSON behavior
> -----------------------------
>
>                 Key: DRILL-6035
>                 URL: https://issues.apache.org/jira/browse/DRILL-6035
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to