[
https://issues.apache.org/jira/browse/HIVE-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
lvhu updated HIVE-27590:
------------------------
Description:
*The only way to set line delimiters when creating tables in the current hive
is like this:*
{code:java}
package abc.hive.MyFstTextInputFormat
public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
...
}
create table test (
id string,
name string
)
INPUTFORMAT 'abc.hive.MyFstTextInputFormat' {code}
If there are multiple different record delimiters, multiple TextInputFormats
need to be rewritten.
Unluckily, The ideal method is not supported yet:
{code:java}
create table test (
id string,
name string
)
row format delimited fields terminated by '\t' -- supported
LINES TERMINATED BY '|@|' ; -- not supported {code}
I have a solution that supports setting line delimiters when creating tables
just like above.
*1.create a new HiveTextInputFormat class to replace TextInputFormatn class.*
HiveTextInputFormat class read <pathToDelimiter> file to support setting record
delimiter for input files based on the prefix of the file path.
{code:java}
public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
....
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
// default delimiter
String delimiter = job.get("textinputformat.record.delimiter");
//Obtain the path of the file
String filePath = genericSplit.getPath().toUri().getPath();
//Obtain a list of file paths and delimiter relationships by parsing the
<pathToDelimiter> file
Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the
<pathToDelimiter> file
for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
//config path
String configPath = entry.getKey();
//if configPath is the prefix of filePath, set delimiter corresponding to
the file path
if(filePath.startsWith(configPath)) delimiter = entry.getValue();
}
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(job, (FileSplit) genericSplit,
recordDelimiterBytes);
}
} {code}
*2. modify hive create table class to support <LINES TERMINATED BY>*
{code:java}
create table test (
id string,
name string
)
LINES TERMINATED BY '|@|' ;
LOCATION hdfs_path; {code}
If Users execute above SQL, hive will insert (hdfs_path,'|@|') to
<pathToDelimiter> file.
Set HiveTextInputFormat as default INPUTFORMAT .
Looking forward to receiving your suggestions and feedback!
*If you accept my idea, I hope you can assign the task to me. My Github account
is: _lvhu-goodluck_*
I really hope to contribute code to the community
was:
*The only way to set line delimiters when creating tables in the current hive
is like this:*
{code:java}
package abc.hive.MyFstTextInputFormat
public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
...
}
create table test (
id string,
name string
)
INPUTFORMAT 'abc.hive.MyFstTextInputFormat' {code}
If there are multiple different record delimiters, multiple TextInputFormats
need to be rewritten.
Unluckily, The ideal method is not supported yet:
{code:java}
create table test (
id string,
name string
)
row format delimited fields terminated by '\t' -- supported
LINES TERMINATED BY '|@|' ; -- not supported {code}
I have a solution that supports setting line delimiters when creating tables
just like above.
*1. create a new HiveTextInputFormat class to replace TextInputFormatn class.*
HiveTextInputFormat class read <pathToDelimiter> file to support setting record
delimiter for input files based on the prefix of the file path.
{code:java}
public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
....
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
// default delimiter
String delimiter = job.get("textinputformat.record.delimiter");
//Obtain the path of the file
String filePath = genericSplit.getPath().toUri().getPath();
//Obtain a list of file paths and delimiter relationships by parsing the
<pathToDelimiter> file
Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the
<pathToDelimiter> file
for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
//config path
String configPath = entry.getKey();
//if configPath is the prefix of filePath, set delimiter corresponding to
the file path
if(filePath.startsWith(configPath)) delimiter = entry.getValue();
}
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(job, (FileSplit) genericSplit,
recordDelimiterBytes);
}
} {code}
*2. modify hive create table class to support <LINES TERMINATED BY>*
{code:java}
create table test (
id string,
name string
)
LINES TERMINATED BY '|@|' ;
LOCATION hdfs_path; {code}
If Users execute above SQL, hive will insert (hdfs_path,'|@|') to
<pathToDelimiter> file.
Looking forward to receiving your suggestions and feedback!
*If you accept my idea, I hope you can assign the task to me. My Github account
is: _lvhu-goodluck_*
I really hope to contribute code to the community
> Make LINES TERMINATED BY work when creating table
> -------------------------------------------------
>
> Key: HIVE-27590
> URL: https://issues.apache.org/jira/browse/HIVE-27590
> Project: Hive
> Issue Type: Improvement
> Components: Hive, SQL
> Affects Versions: 3.1.3
> Environment: {code:java}
> //代码占位符
> {code}
> Reporter: lvhu
> Assignee: lvhu
> Priority: Major
>
> *The only way to set line delimiters when creating tables in the current hive
> is like this:*
> {code:java}
> package abc.hive.MyFstTextInputFormat
> public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text>
> implements JobConfigurable {
> ...
> }
> create table test (
> id string,
> name string
> )
> INPUTFORMAT 'abc.hive.MyFstTextInputFormat' {code}
> If there are multiple different record delimiters, multiple TextInputFormats
> need to be rewritten.
> Unluckily, The ideal method is not supported yet:
> {code:java}
> create table test (
> id string,
> name string
> )
> row format delimited fields terminated by '\t' -- supported
> LINES TERMINATED BY '|@|' ; -- not supported {code}
> I have a solution that supports setting line delimiters when creating tables
> just like above.
> *1.create a new HiveTextInputFormat class to replace TextInputFormatn class.*
> HiveTextInputFormat class read <pathToDelimiter> file to support setting
> record delimiter for input files based on the prefix of the file path.
> {code:java}
> public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
> implements JobConfigurable {
> ....
> public RecordReader<LongWritable, Text> getRecordReader(
> InputSplit genericSplit, JobConf
> job,
> Reporter reporter)
> throws IOException {
>
> reporter.setStatus(genericSplit.toString());
> // default delimiter
> String delimiter = job.get("textinputformat.record.delimiter");
> //Obtain the path of the file
> String filePath = genericSplit.getPath().toUri().getPath();
> //Obtain a list of file paths and delimiter relationships by parsing the
> <pathToDelimiter> file
> Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the
> <pathToDelimiter> file
> for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
> //config path
> String configPath = entry.getKey();
> //if configPath is the prefix of filePath, set delimiter corresponding
> to the file path
> if(filePath.startsWith(configPath)) delimiter = entry.getValue();
>
> }
> byte[] recordDelimiterBytes = null;
> if (null != delimiter) {
> recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
> }
> return new LineRecordReader(job, (FileSplit) genericSplit,
> recordDelimiterBytes);
> }
> } {code}
> *2. modify hive create table class to support <LINES TERMINATED BY>*
> {code:java}
> create table test (
> id string,
> name string
> )
> LINES TERMINATED BY '|@|' ;
> LOCATION hdfs_path; {code}
> If Users execute above SQL, hive will insert (hdfs_path,'|@|') to
> <pathToDelimiter> file.
> Set HiveTextInputFormat as default INPUTFORMAT .
> Looking forward to receiving your suggestions and feedback!
> *If you accept my idea, I hope you can assign the task to me. My Github
> account is: _lvhu-goodluck_*
> I really hope to contribute code to the community
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)