[ 
https://issues.apache.org/jira/browse/LIVY-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930215#comment-16930215
 ] 

runzhiwang edited comment on LIVY-667 at 9/18/19 8:16 AM:
----------------------------------------------------------

There are several design to support query a lot of data.

1.Merge the result rdd to one partition, and save as a single file in hdfs. And 
livy reads the file line by line directly.  Cons: it's slow to read line by 
line.
2.Repartition each partition into fixed size, and save in hdfs. And livy reads 
by toLocalIterator which read one partition into memory at one time. Cons: 
there are a lot of files in hdfs if the size of each partition is too small.
3.Cache rdd, and read each partition by batch. Cons: the shortage of memory and 
disk will cause the recompute of rdd, which maybe time-consuming
4.Save rdd to hdfs without repartition. and read each partition by batch. Cons: 
a little complicated to implement.


was (Author: runzhiwang):
I'm working on it

> Support query a lot of data.
> ----------------------------
>
>                 Key: LIVY-667
>                 URL: https://issues.apache.org/jira/browse/LIVY-667
>             Project: Livy
>          Issue Type: Bug
>          Components: Thriftserver
>    Affects Versions: 0.6.0
>            Reporter: runzhiwang
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When enable livy.server.thrift.incrementalCollect, thrift use toLocalIterator 
> to load one partition at each time instead of the whole rdd to avoid 
> OutOfMemory. However, if the largest partition is too big, the OutOfMemory 
> still occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to