[
https://issues.apache.org/jira/browse/SQOOP-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226919#comment-14226919
]
Jarek Jarcec Cecho commented on SQOOP-1811:
-------------------------------------------
Yeah, I do see your point [~vybs]. Let's rename the {{getTextData()}} method to
{{getCSVTextData()}}, but let's put very explicit comment that the CSV format
is very strict. I would even love a link to documentation describing the CSV
IDF format. I know that we don't have it right now, so we can add the link in
subsequent JIRA (when we will have the doc).
For the second part of the proposal, I'm still having doubts that we can define
the {{get*TextData()}} as final. As IDF can have arbitrary internal
representation, the author needs to provide a way how to convert the internal
structures to the text, right? Probably using the Utilities that you're
introducing in SQOOP-1813.
The idea behind having both all three methods - {{getData()}},
{{get*TextData()}} and {{getObjectData()}} as a part of public interface is to
avoid necessary conversions on very common data paths. I know that not all the
paths are in the code yet, so let me try to put few examples:
* If both source and destination IDF is the same, then we can exchange data
using {{getData()}} and {{setData()}} without doing any conversions what so
ever.
* If the source and destination IDF differs or for purpose of doing
transformation later, we also need to have usual "row" view of the data and
hence we do have the {{getObjectData()}} and {{setObjectData()}}
* Based on history that we got with Sqoop 1 with specialized connectors,
majority of the "fast connectors" are getting data in textual format (e.g.
{{mysqldump}}, {{pg_dump}}, Netezza external tables). Also very common
destination on HDFS is text as well (even though we would prefer Avro and
Parquet, but not all customer apps are working with that format yet). Hence we
wanted to have a path to optimize for this kind of transfer as well. Rather
then converting entire row into short lived objects using {{getData()}}, we can
do a small processing on the string (substitute separators, ...) and get output
of {{mysqldump}} and others into text format that we can pass to another IDF
and eventually to target destination.
> IDF API changes
> ---------------
>
> Key: SQOOP-1811
> URL: https://issues.apache.org/jira/browse/SQOOP-1811
> Project: Sqoop
> Issue Type: Sub-task
> Components: sqoop2-framework
> Reporter: Veena Basavaraj
> Fix For: 1.99.5
>
>
> 1. update the java docs for IDF apis.
> 2. Make the getTextData final and call it getCSV and setCSV, so it is
> obvious that we want to enforce CSV format
> the following code can move to the base class IntermediateDataFormat and
> made final, so there is no way to override this and we can enforce all to
> return String instead of generic T
> {code}
> // hold the string in IDF base class
> private final String text.
>
> public final String getCSVTextData() {
> return text;
> }
>
> public final void setCSVTextData(String text) {
> this.text = text;
> }
> {code}
> There is code in CSVIDF implementation that has the rules for CSV parsing
> that can be pulled out into CSV Utils so that the connectors can use
> The T in CSV happens to String, which is just a coincidence, If I write a new
> IDF implementation T can be a custom object that could encapsulate the whole
> row.
> Third, getData and setData can have custom implementation so they can be
> overriden to return the generic type T
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)