Support Big Data formats in CREATE TABLE EXTERNAL FILE ------------------------------------------------------
Key: CORE-5663 URL: http://tracker.firebirdsql.org/browse/CORE-5663 Project: Firebird Core Issue Type: New Feature Components: Engine Environment: Big Data, Amazon AWS, Azure Cloud Reporter: Juarez Rudsatz Priority: Minor With little effort, firebird could be extended for covering many big data processing cenarios. Basically Big Data processing is done in two ways: - Batch: a program using a big data batch framework reads data from structured storage sources, converts for a programing format like object/struct (properties) or dataset/dataframe (rows/cols), makes several transformations like map, reduce, join, group by, filter and writes the output to a new structured storage. - Streaming: a program using a streaming framework reads data from realtime ou buffered sources and writes to other realtime/buffered destinations or to a structured storage. Batch frameworks commonly used are Hadoop, Spark, Pig and several others. Streaming frameworks commonly used are Spark streaming, Kafka, Amazon Kinesis, Amazon Firehose, etc... Structured sources can be database data acessed by jdbc or files accessed from network drives, Hadoop Hdfs filesystems, AWS S3 filesystems or Azure Storage filesystems. Usually the processed data is consumed by: a) directly exporting to a spreadsheet (csv) in a ad-hoc manner b) uploaded to a database or datawarehouse/BI infrastructure c) stored in a pre-summarized format in a structured source for further processing or analysis Tools used for analysis in cenario c), besides Batch frameworks are: Apache Hive, Amazon Athena, Amazon Spectrum. They basically provides a mecanism to to query files stored in structured sources like Amazon S3, using plain SQL or PIG languages. Firebird could take a slice of this market just adding some basic support for this workflow. For performing well in this cenario, firebird should: 1) have a very fast data injection/bulk insert like Amazon Redshift COPY command (Postgresql columnar clone) 2) support the file formats commonly used in big data like: CSV/TSV, Avro, Parquet, ORC, Grok, RCFile, RegexSerDe, SequenceFile 3) extend EXTERNAL FILE for reading these formats from remote structured sources like the cited above. These can be done by specifying a FORMAT to the CREATE TABLE EXTERNAL FILE existing command. Most of these formats and filesystems have libraries which can be used for speeding the development. Same way, one could start with the most used formats (CSV/TSV, Parquet, Avro) and most used filesystems (AWS S3, Azure Storage). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://tracker.firebirdsql.org/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel