Re: Reading CSV data from Stream

Julian Hyde Sat, 07 May 2016 09:15:20 -0700

Excellent — I will review.

Julian



> On May 7, 2016, at 3:03 AM, Toivo Adams <[email protected]> wrote:
> 
> Hi,
> 
> First NiFi processor which use modified CSV adapter is done.
> https://issues.apache.org/jira/browse/NIFI-1280
> 
> Modified CSV adapter is included in NiFi code, hopefully this is not a
> problem.
> 
> Any feedback is highly welcome.
> 
> Thanks
> Toivo
> 
> 
> 2016-05-05 23:26 GMT+03:00 Toivo Adams <[email protected]>:
> 
>> Hi,
>> 
>> This is a work in progress.
>> Far from ready and done.
>> 
>> I will try create first NiFi processor using current modified Calcite CSV
>> adapter.
>> Hopefully NiFi community will accept idea to use SQL to specify how CSV
>> (or any other data format) data should be transformed.
>> 
>> Next step would be to implement right Calcite adapter.
>> As you said current modified CSV adapter faked out Calcite to believe that
>> InputStream is the Orders *table*
>> 
>> It seems current modified CSV adapter is not very useful and not something
>> we can consider Calcite official adapter, its only prototype.
>> 
>> I will inform You when I have done NiFi processor.
>> Hopefully first version is ready within few days.
>> I am very grateful to receive any feedback.
>> 
>> I don't have any plans to use any other platforms (Samza, Flink, Storm
>> etc.)
>> NiFi itself should cover everything needed.
>> 
>> And to support advanced features, of course much better Calcite adapter
>> implementation is needed.
>> And after CSV, next is probably Avro
>> 
>> I am not sure I am able to implement correct Calcite streaming adapter,
>> because of limited knowledge.
>> But if no one else has time to do it, I may try.
>> 
>> And finally I used 'stream' and 'streaming' very loosely and this was
>> confusing.
>> Sorry.
>> 
>> Thanks
>> toivo
>> 
>> 2016-05-05 22:47 GMT+03:00 Julian Hyde <[email protected]>:
>> 
>>> It sounds to me as if you will be able to get something half-baked on
>>> InputStream working easily, but it won't support parallelism,
>>> distribution, reliability, fail-over, and replay.
>>> 
>>> I don't have a problem with that. You can write your application now
>>> and re-deploy on more a robust streaming SQL implementation (Samza,
>>> Flink, Storm etc.) later.
>>> 
>>> But be sure that you are using the "stream" keyword in your queries, e.g.
>>> 
>>>  select stream * from Orders where units > 10
>>> 
>>> I have a feeling that you have faked out Calcite to believe that your
>>> InputStream is the Orders *table*, whereas you should be making
>>> Calcite believe that your InputStrema is the Orders *stream*, to that
>>> if you ever were to write
>>> 
>>>  select count(*) from Orders
>>> 
>>> Calcite would be able to go somewhere else for the historical table.
>>> 
>>> Julian
>>> 
>>> 
>>> On Wed, May 4, 2016 at 10:44 AM, Toivo Adams <[email protected]>
>>> wrote:
>>>> Its
>>>> java.io.InputStream
>>>> 
>>>> Actually it was not very difficult to modify CSV adapter to accept
>>> stream
>>>> instead of file.
>>>> But of course, current POC may have some problems which are not shown up
>>>> yet.
>>>> 
>>>> CSVReader accept stream reader also.
>>>> 
>>>> 
>>>> 
>>>> 2016-05-04 20:28 GMT+03:00 Julian Hyde <[email protected]>:
>>>> 
>>>>> Can you clarify what you mean by Stream? Do you mean
>>>>> java.util.Stream<String>?
>>>>> 
>>>>> The Csv adapter is first and foremost a file adapter. It might be
>>> easier
>>>>> to create a stream adapter and make it parse csv than start with a file
>>>>> adapter and make it handle streams.
>>>>> 
>>>>>> On May 4, 2016, at 10:07 AM, Toivo Adams <[email protected]>
>>> wrote:
>>>>>> 
>>>>>> Julian,
>>>>>> 
>>>>>> Thank You for Your suggestions.
>>>>>> 
>>>>>> I don't want to monitor file and read appended records.
>>>>>> Initially I want to read from in-memory stream.
>>>>>> Such a stream can be very, very large and doesn't fit in memory.
>>>>>> 
>>>>>> My idea is to create NiFi processor which uses SQL for data
>>> manipulation.
>>>>>> https://nifi.apache.org/
>>>>>> 
>>>>>> NiFi already contains large set of processors which filter, split,
>>> route,
>>>>>> etc. different data.
>>>>>> Data can be CSV, JSON, Avro, whatever.
>>>>>> 
>>>>>> Different processors use different parameters how data should be
>>>>> filtered,
>>>>>> split, routed, etc.
>>>>>> I think it would be nice to able to use SQL statement to specify how
>>> data
>>>>>> should be filtered, split, etc.
>>>>>> Because NiFi is able to use very big data sets (called FlowFile in
>>> Nifi),
>>>>>> streaming is as must.
>>>>>> 
>>>>>> I created very simple of POC how to use Stream instead of File.
>>>>>> I just created new modified versions of
>>>>>> src/main/java/org/apache/calcite/adapter/csv/CsvEnumerator2.java
>>>>>> src/main/java/org/apache/calcite/adapter/csv/CsvSchema2.java
>>>>>> src/main/java/org/apache/calcite/adapter/csv/CsvSchemaFactory2.java
>>>>>> src/main/java/org/apache/calcite/adapter/csv/CsvTableScan2.java
>>>>>> 
>>> src/main/java/org/apache/calcite/adapter/csv/CsvTranslatableTable2.java
>>>>>> 
>>>>>> CALCITE-1227
>>>>>> describes little bit different use case.
>>>>>> 
>>>>>> I am ready to contribute back, but my Calcite knowledge is very
>>> limited.
>>>>>> So current POC is more like a hack and not good code.
>>>>>> Should I upload my current POC files to CALCITE-1227
>>>>>> or it is better to create another issue?
>>>>>> 
>>>>>> Thanks
>>>>>> toivo
>>>>>> 
>>>>>> 
>>>>>> 2016-05-04 19:02 GMT+03:00 Julian Hyde <[email protected]>:
>>>>>> 
>>>>>>> I’ve logged https://issues.apache.org/jira/browse/CALCITE-1227 <
>>>>>>> https://issues.apache.org/jira/browse/CALCITE-1227>. Feel free to
>>> start
>>>>>>> implementing it!
>>>>>>> 
>>>>>>>> On May 4, 2016, at 8:56 AM, Julian Hyde <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> It’s not straightforward to re-use a table adapter as a stream
>>> adapter.
>>>>>>> The reason is that one query might want to see the past (the current
>>>>>>> contents of the table) and another query might want to see the
>>> future
>>>>> (the
>>>>>>> stream of records added from this point on).
>>>>>>>> 
>>>>>>>> I’m guessing that you want something like the CSV adapter that
>>> watches
>>>>> a
>>>>>>> file and reports records added to the end of the file (like the tail
>>>>>>> command[1]).
>>>>>>>> 
>>>>>>>> You’d have to change CsvTable to implement StreamableTable, and
>>>>>>> implement the ‘Table stream()’ method to return a variant of the
>>> table
>>>>> that
>>>>>>> is in “follow” mode.
>>>>>>>> 
>>>>>>>> It would probably be implemented by a variant of CsvEnumerator,
>>> but it
>>>>>>> is getting its input in bursts, as the file is appended to.
>>>>>>>> 
>>>>>>>> Hope that helps.
>>>>>>>> 
>>>>>>>> Julian
>>>>>>>> 
>>>>>>>> [1] https://en.wikipedia.org/wiki/Tail_(Unix) <
>>>>>>> https://en.wikipedia.org/wiki/Tail_(Unix)>
>>>>>>>> 
>>>>>>>>> On May 2, 2016, at 3:15 AM, Toivo Adams <[email protected]
>>>>> <mailto:
>>>>>>> [email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> One possibility is to modify CsvEnumerator
>>>>>>>>> Opinions?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Toivo
>>>>>>>>> 
>>>>>>>>> 2016-05-01 18:35 GMT+03:00 Toivo Adams <[email protected]
>>>>> <mailto:
>>>>>>> [email protected]>>:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Please help newbie.
>>>>>>>>>> CSV works well reading files, but I want to read data from
>>> stream.
>>>>>>>>>> Data is not fixed length, may be endless stream.
>>>>>>>>>> 
>>>>>>>>>> Any ideas how to accomplish this?
>>>>>>>>>> Should I try to modify CsvTranslatableTable?
>>>>>>>>>> Or should I take Cassandra adapter as example?
>>>>>>>>>> 
>>>>>>>>>> Initially data will be CSV but later Avro is also good candidate.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Toivo
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Reading CSV data from Stream

Reply via email to