[
https://issues.apache.org/jira/browse/FLINK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742507#comment-15742507
]
Ivan Mushketyk commented on FLINK-5280:
---------------------------------------
Hi [~fhueske] ,
Thank you for your comments. It's a much clearer now, but it seems that I am
either still missing something obvious or it seems to me that the task is more
involved than it was described.
Let me first describe how I understand this issue so that you could correct me.
So the goal of this task is to support nested data structures. So it means that
if we have a type definition like this:
{code:java}
class ParentPojo {
ChildPojo child;
int num;
}
class ChildPojo {
String str;
}
{code}
and we have a *TableSource* that returns a dataset of *ParentPojo* we can
access nested fields in SQL queries. Something like:
{code:sql}
SELECT * FROM pojos WHERE child.str LIKE '%Rubber%'
{code}
In this case *child.str* is a way to access a nested field.
The first thing that confuses me is that current [SQL
grammar|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/table_api.html#sql-syntax]
does not seem to support any nested fields access, but I think may be a
relatively minor nuisance.
If I understand it correctly internally *flink-table* converts any input into a
dataset of Rows and then performs operations on it. To convert a nested
*ParentPojo* into a flat schema we can extract all leaf values into two columns:
{code}
child.str num
{code}
similarly to how *Parquet* identifies columns in nested types (see the
following
[slide|http://www.slideshare.net/julienledem/strata-london-2016-the-future-of-column-oriented-data-processing-with-arrow-and-parquet/10?src=clipshare])
Now, where this becomes more interesting. If I understand it correctly
*BatchScan#convertToExpectedType* is used to convert an input dataset into a
dataset of *Row*s. For this task it generates a mapper function in
*FlinkRel#getConversionMapper* which than calls
*CodeGenerator#generateConverterResultExpression*.
So in our case it should generate code similar to something like:
{code:java}
public Row map(ParentPojo parent) {
Row row = new Row(2);
row.setField(0, parent.child.str);
row.setField(1, parent.num);
return row;
}
{code}
*CodeGenerator* accepts *fieldNames* and optional POJO field mapping to
generate accessors. It seems that the main work is performed in
*CodeGenerator#generateFieldAccess* that generates an access code for different
fields of the POJO, but it does not create any code that accesses nested
fields. It just generates an access code to a POJO field with a corresponding
field name in CodeGenerator#generateFieldAccess.
Therefore, if I understand this correctly, we need to start with updating
*CodeGenerator* to generate nested accessors and then we can extend
*TableSource* to support nested data.
Am I overthink this issue? Or am I missing something obvious?
> Extend TableSource to support nested data
> -----------------------------------------
>
> Key: FLINK-5280
> URL: https://issues.apache.org/jira/browse/FLINK-5280
> Project: Flink
> Issue Type: Improvement
> Components: Table API & SQL
> Affects Versions: 1.2.0
> Reporter: Fabian Hueske
> Assignee: Ivan Mushketyk
>
> The {{TableSource}} interface does currently only support the definition of
> flat rows.
> However, there are several storage formats for nested data that should be
> supported such as Avro, Json, Parquet, and Orc. The Table API and SQL can
> also natively handle nested rows.
> The {{TableSource}} interface and the code to register table sources in
> Calcite's schema need to be extended to support nested data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)