Selina,

I would use parquet-avro to create a writer. Kafka messages are commonly encoded as Avro, so you may already be working with Avro objects. If not, then convert to Avro and then write to the AvroParquetWriter.

You can create a the writer that creates S3 files by setting up your S3 file system settings in a Configuration and then using paths that look like this: s3n://s3bucket-name/path/within/bucket. You would just pass that Path to the AvroParquetWriter.builder method, configure the builder, and call build() to get a configured writer.

rb

On 11/09/2015 04:55 PM, Selina Tech wrote:
Hi, Ryan:

         Thanks a lot for your suggestion.  I do not have to get the
output stream if I could write my continually Kafka message (in json,
cvs or avro format) to AWS S3 in parquet format. Would you like to
introduce a little bit more detail about it and then I find some
solution in detail?

          There is one solution. Create a Parquet table by presto in
Hive, and use Presto Hive connector to sql data and save the data to
Hive and then send the data to S3. I am wondering if there is a better
solution?

Sincerely,
Selina

On Mon, Nov 9, 2015 at 9:35 AM, Ryan Blue <[email protected]
<mailto:[email protected]>> wrote:

    Selina,

    You should be able to write to S3 without needing to flush to an
    output stream. You would just use the S3 FileSystem to write data
    instead of HDFS. This doesn't need to require Parquet to write to an
    OutputStream instead of a file. Is there a reason why you want to
    supply an output stream instead?

    rb


    On 11/05/2015 05:56 PM, Selina Tech wrote:

        Dear all:

                I am wondering if I could read input stream such as
        Kafka and convert
        it Parquet data  and write back to output stream?  All example I
        found
        convert data file to Parquet data.

                I know this feature is not available last year. How
        about right now?

                I am trying to aggregate Kafka message by Samza and
        convert it to
        Parquet data and then save it to S3. What is the best one to
        implement it?


        Sincerely,
        Selina

        reference:
        https://github.com/Parquet/parquet-mr/issues/231



    --
    Ryan Blue
    Software Engineer
    Cloudera, Inc.




--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to