Paul Rogers created DRILL-5483:
----------------------------------

             Summary: Production-quality solution to define text file field 
types and widths
                 Key: DRILL-5483
                 URL: https://issues.apache.org/jira/browse/DRILL-5483
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers


This bug is in response to the work done in DRILL-5419. In that PR, we 
essentially:

* Define field width in a CAST statement: CAST(columns[2] AS VARCHAR(10))
* Propagate known size information up through the internal representation to 
compute widths for each column in the result set.

This is a wonderful start and a big improvement. Users can create views that 
contain the needed casts. The size information allows tools that need size 
information to work correctly. All is good.

However, if we start thinking about the user implications, we quickly realize 
that the above is just a very partial fix for real-world use in a data lake 
application.

* A data lake has many types of files with new ones added constantly.
* Drill is often used for data discovery: to see what is in each file.
* The number of files is typically huge: 100K, 1 M or more. (This is, after 
all, Big Data.)
* New files, of existing types, arrive constantly. For example, web server logs 
might arrive every five minutes.

Let's consider how the DRILL-5419 fix would apply in this environment.

* If a user queries a file directly, without a CAST, Drill will have no column 
width information and will return a width of 64K (the maximum field width 
allowed by Drill) to the client, which will fail due to over-sized buffers.
* The user must repeat the query, but assign a width to each column. Since this 
is data discovery, the user does not know the width.
* So, the user must run a query to compute the maximum width by scanning all 
the data. Write down the answers. Use this in a CAST in each subsequent query.

Now, the above can be simplified. Once we know the widths:

* Create a view for the file(s). This requires that the analytic user have 
write access to the file system and use a tool other than Drill to create the 
view.
* As new files arrive, rerun the max-length query to check for new lengths. If 
so, manually update the views.

For the above to work, views must be created for each and every file (or users 
must share expected widths somehow and write the CAST statement into queries 
for files for which views are not defined.

But, this is a big data system, so there are millions of files. So, work out a 
way to create views for all these files. Perhaps create scripts that scan all 
new files and contain code to revise the views.

But, this is a multi-user system, so the users must agree on who will do the 
full-table scan to compute the widths. For ad-hoc us, they must define a Wiki 
or e-mail system or other means to share the widths to use in casts (with the 
information eventually going into views.)

If done by script, then the scripts have to handle race conditions that occur 
when replacing views while users may be trying to access the views. Done wrong 
and users will get occasional failures due to missing or partial view 
definitions as they try to read the file while the script (or a human) is 
updating them.

Views require different names in the query than the table. So, users must know 
when to us the actual file name (with manually-tracked field widths) and when 
to use view names. That information must be published somewhere so users can 
consult it. Simply looking at available files is not enough, the user must know 
that file a/b/c/d.csv must be queried with a/b/c/d-view.drill. Train the users 
on these rules.

Views must be stored somewhere. Putting them with data has permission and 
directory time-stamp issues. (Adding a view changes the directory time-stamp, 
but that time-stamp is often used to detect new files.) Putting them in some 
other location requires know the file-to-view location mapping. Either solution 
is a major cost.

The full table scans to compute field lengths duplicate work to be done by the 
proposed statistics system. (The stats collection will compute other values, 
but not maximum field width.) This doubles the load on the system.

Drill CAST operations are based on the idea that, if the user says that the 
field should be 20 characters, then go ahead and truncate the rest. But, the 
use case here is that we want the actual field width, the CAST is just a 
work-around. Truncating the data can mean data loss. (Truncating "12345678" to 
"12345", because that's what we expect, changes the meaning of the number and 
should be considered data corruption.)

The DRILL cast operator works by making a copy. So, we greatly degrade 
performance by making data copies when all we really want to do is to specify 
width. Thus, the trade off is to overload the client tool, or slow query 
performance.

Drill must scan the views prior to each query. To improve performance, we'd 
want to cache the views. But, each Drillbit is independent of the others, so 
each would cache its own copy. With millions of views, the memory consumed by 
this caching will become an issue.

Looking at the big picture, Drill is designed for schema-free data exploration. 
That goal is hard to achieve if to get schema-free access, the user first has 
to impose a schema (via CASTs, optionally in views.) It is also hard to achieve 
if data exploration requires a pre-exploration to determine field widths.

One can go on.

The point is, for field widths to work, Drill should:

* Sample data to guess widths to avoid user hassle during data exploration. 
Drill reads ~64K records per batch. Simply reading the first batch, and 
buffering it, would give us, say, 99% accuracy of field widths.
* Allow an optional "schema hints" file to be associated with a type of file. 
The hints file can specify preferred column names, field widths and data types.
* Provide a transparent mapping from table name to schema hints so that the 
user always explores the file, with the schema hints being found when available.
* Impose the schema at read time. That is, rather than read data into one 
column, then copy into another (via a CAST) to specify width, or set the field 
as a number, do the translation at read time for optimal performance and 
minimal memory impact.
* Tie creation and update of the schema hints file with stats collection. Since 
stats scans the entire file, allow it to compute maximum field width, validate 
that the data does, in fact, fit the preferred type, etc. 
* Parquet metadata use and creation is currently subject to the kind of race 
conditions described above. Solve the problem for Parquet, then use the same 
solution for the schema hints files.
* Perhaps bite the bullet, learn from Hive and Impala, and provide a 
cluster-wide metadata store that can manage and cache the Parquet metadata, 
text file schemas and statistics data.

The result will be that users will get accurate column names, types and widths 
without having to use CAST statements in each query or without having to 
manually create views for every file.

The above is obviously not a complete design. Rather, it is a state in the 
ground for how to provide the best experience to users beyond the simple 
work-around provided by DRILL-5419.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to