Support Big Data formats in CREATE TABLE EXTERNAL FILE
------------------------------------------------------

                 Key: CORE-5663
                 URL: http://tracker.firebirdsql.org/browse/CORE-5663
             Project: Firebird Core
          Issue Type: New Feature
          Components: Engine
         Environment: Big Data, Amazon AWS, Azure Cloud
            Reporter: Juarez Rudsatz
            Priority: Minor


With little effort, firebird could be extended for covering many big data 
processing cenarios.

Basically Big Data processing is done in two ways:
- Batch: a program using a big data batch framework reads data from structured 
storage sources, converts for a programing format like object/struct 
(properties) or dataset/dataframe (rows/cols), makes several transformations 
like map, reduce, join, group by, filter and writes the output to a new 
structured storage.
- Streaming: a program using a streaming framework reads data from realtime ou 
buffered sources and writes to other realtime/buffered destinations or to a 
structured storage.

Batch frameworks commonly used are Hadoop, Spark, Pig and several others.
Streaming frameworks commonly used are Spark streaming, Kafka, Amazon Kinesis, 
Amazon Firehose, etc...
Structured sources can be database data acessed by jdbc or files accessed from 
network drives, Hadoop Hdfs filesystems, AWS S3 filesystems or Azure Storage 
filesystems.

Usually the processed data is consumed by:
a) directly exporting to a spreadsheet (csv) in a ad-hoc manner
b) uploaded to a database or datawarehouse/BI infrastructure
c) stored in a pre-summarized format in a structured source for further 
processing or analysis

Tools used for analysis in cenario c), besides Batch frameworks are: Apache 
Hive, Amazon Athena, Amazon Spectrum.
They basically provides a mecanism to to query files stored in structured 
sources like Amazon S3, using plain SQL or PIG languages.

Firebird could take a slice of this market just adding some basic support for 
this workflow.

For performing well in this cenario, firebird should:
1) have a very fast data injection/bulk insert like Amazon Redshift COPY 
command (Postgresql columnar clone)
2) support the file formats commonly used in big data like: CSV/TSV, Avro, 
Parquet, ORC, Grok, RCFile, RegexSerDe, SequenceFile
3) extend EXTERNAL FILE for reading these formats from remote structured 
sources like the cited above.

These can be done by specifying a FORMAT to the CREATE TABLE EXTERNAL FILE 
existing command.
Most of these formats and filesystems have libraries which can be used for 
speeding the development.
Same way, one could start with the most used formats (CSV/TSV, Parquet, Avro) 
and most used filesystems (AWS S3, Azure Storage).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://tracker.firebirdsql.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to