This is one place, however I'd have to look at the LO API to understand if it 
gets past the memory allocation limitation, and then we'd have to discuss 
design of implementation and whether it would be implemented both in GPDB and 
HAWQ - which would be a requirement for MADlib.

Sent from my iPhone

> On Dec 23, 2015, at 1:39 PM, Ivan Novick <[email protected]> wrote:
> 
> Hi Roman,
> 
> There are requests for bigger intermediate data on madlib.
> 
> Here is an extract from a request:
> 
> """
> Currently 1GB is the max field size for any data in a column in a row. We
> want to increase this in GPDB 100GB. This will also be used by data science
> to address issue below and also to store in a column a bigger thing like an
> XML or JSON doc that is larger than 1GB.
> 
> As a developer, I want to maintain a larger internal aggregate state in
> memory > 1 GB, so that I can operate on larger data sets.
> 
> Notes
> 1) Many MADlib algorithms need to maintain large internal aggregates. One
> example is the LDA algorithm that is limited to the number of topics X
> vocabulary sizes < ~250M due to the 1 GB limit. For text analytics, this is
> quite restrictive.
> References
> [1] http://www.postgresql.org/docs/9.4/static/sql-createaggregate.html
> """
> 
> On Wed, Dec 23, 2015 at 1:17 PM, Roman Shaposhnik <[email protected]>
> wrote:
> 
>> Atri,
>> 
>> I'm curious what usage to you see for LOs when
>> it comes to MADlib?
>> 
>> Thanks,
>> Roman.
>> 
>>> On Tue, Dec 22, 2015 at 7:53 AM, Atri Sharma <[email protected]> wrote:
>>> Hi All,
>>> 
>>> We are currently working on making Greenplum Large Objects better and
>>> awesome.
>>> 
>>> We were thinking of seeing if MADLib can benefit from Large Objects and
>> use
>>> them in a manner which is helpful. MADLib can see if Large Objects can be
>>> used as intermediate objects for intermediate states that are large.
>>> 
>>> Large Objects API can be seen
>>> http://www.postgresql.org/docs/9.2/static/largeobjects.html
>>> 
>>> Large Objects will eventually scale out in Greenplum. They will be
>>> distributed across cluster and queries will be performant.
>>> 
>>> Regards,
>>> 
>>> Atri
>> 

Reply via email to