[ 
https://issues.apache.org/jira/browse/MADLIB-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1019:
------------------------------------
    Priority: Major  (was: Minor)

> Scalability of SVEC
> -------------------
>
>                 Key: MADLIB-1019
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1019
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Sparse Vectors
>            Reporter: Frank McQuillan
>
> Entered on behalf of a user doing text analytics work...
> We're testing with some MADlib functions (we're running this install of 
> MADlib madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we 
> are running into some performance issues as we try to scale up our data set. 
> We took a subset of our data and ran on a varying number of rows with the 
> rows being between 900 and 1000 bytes long.  The following bullets show the 
> number of rows in our base dataset (which feeds into SVEC) and the time it 
> took to run:
> {code}
> 1,000 rows -> 1/2 sec
> 10,000 rows -> 20 sec
> 100,000 rows -> @15mins we killed the process.
> {code}
> This is not scaling anywhere near linearly and also it is demonstrating 
> severe skew in that only one postgres process on one node is used during this 
> processing.  The query that we're running is:
> {code}
> CREATE TABLE public.tfidf AS (
>    SELECT
>       doc_id as document_id,
>         madlib.svec_mult( sparse_vector, logidf ) tf_idf
>     FROM
>         public.corpus,
>         ( SELECT madlib.svec_log(
>             madlib.svec_div(
>           count(sparse_vector)::madlib.svec,
>             madlib.svec_count_nonzero(sparse_vector)
>             )
>         ) logidf FROM public.corpus ) foo
>     --ORDER BY document_id
> ) DISTRIBUTED BY (document_id)
> {code}
> After some investigation, we determined that the madlib.svec_mult() is the 
> performance bottleneck here.  The internal select that calls svec_div() and 
> svec_count_nonzero() runs relatively quickly.  We're looking for 
> guidance/help on (a) the skew issue and (b) the performance in general 
> because ultimately, we need to scale up to 45M base table rows and that will 
> greatly increase the SVEC size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to