[ 
https://issues.apache.org/jira/browse/MADLIB-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1019:
------------------------------------
    Description: 
Entered on behalf of a user doing text analytics work...

"We're testing with some MADlib functions (we're running this install of MADlib 
madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we are 
running into some performance issues as we try to scale up our data set. 

We took a subset of our data and ran on a varying number of rows with the rows 
being between 900 and 1000 bytes long.  The following bullets show the number 
of rows in our base dataset (which feeds into SVEC) and the time it took to run:

{code}
1,000 rows -> 1/2 sec
10,000 rows -> 20 sec
100,000 rows -> @15mins we killed the process.
{code}

This is not scaling anywhere near linearly and also it is demonstrating severe 
skew in that only one postgres process on one node is used during this 
processing.  The query that we're running is:

{code}
CREATE TABLE public.tfidf AS (
   SELECT
        doc_id as document_id,
        madlib.svec_mult( sparse_vector, logidf ) tf_idf
    FROM
        public.corpus,
        ( SELECT madlib.svec_log(
            madlib.svec_div(
            count(sparse_vector)::madlib.svec,
            madlib.svec_count_nonzero(sparse_vector)
            )
        ) logidf FROM public.corpus ) foo
    --ORDER BY document_id
) DISTRIBUTED BY (document_id)
{code}

After some investigation, we determined that the madlib.svec_mult() is the 
performance bottleneck here.  The internal select that calls svec_div() and 
svec_count_nonzero() runs relatively quickly.  We're looking for guidance/help 
on (a) the skew issue and (b) the performance in general because ultimately, we 
need to scale up to 45M base table rows and that will greatly increase the SVEC 
size."

  was:
Entered on behalf of a user doing text analytics work...

We're testing with some MADlib functions (we're running this install of MADlib 
madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we are 
running into some performance issues as we try to scale up our data set. 

We took a subset of our data and ran on a varying number of rows with the rows 
being between 900 and 1000 bytes long.  The following bullets show the number 
of rows in our base dataset (which feeds into SVEC) and the time it took to run:

{code}
1,000 rows -> 1/2 sec
10,000 rows -> 20 sec
100,000 rows -> @15mins we killed the process.
{code}

This is not scaling anywhere near linearly and also it is demonstrating severe 
skew in that only one postgres process on one node is used during this 
processing.  The query that we're running is:

{code}
CREATE TABLE public.tfidf AS (
   SELECT
        doc_id as document_id,
        madlib.svec_mult( sparse_vector, logidf ) tf_idf
    FROM
        public.corpus,
        ( SELECT madlib.svec_log(
            madlib.svec_div(
            count(sparse_vector)::madlib.svec,
            madlib.svec_count_nonzero(sparse_vector)
            )
        ) logidf FROM public.corpus ) foo
    --ORDER BY document_id
) DISTRIBUTED BY (document_id)
{code}

After some investigation, we determined that the madlib.svec_mult() is the 
performance bottleneck here.  The internal select that calls svec_div() and 
svec_count_nonzero() runs relatively quickly.  We're looking for guidance/help 
on (a) the skew issue and (b) the performance in general because ultimately, we 
need to scale up to 45M base table rows and that will greatly increase the SVEC 
size.


> Scalability of SVEC
> -------------------
>
>                 Key: MADLIB-1019
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1019
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Sparse Vectors
>            Reporter: Frank McQuillan
>
> Entered on behalf of a user doing text analytics work...
> "We're testing with some MADlib functions (we're running this install of 
> MADlib madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 )  While testing, we 
> are running into some performance issues as we try to scale up our data set. 
> We took a subset of our data and ran on a varying number of rows with the 
> rows being between 900 and 1000 bytes long.  The following bullets show the 
> number of rows in our base dataset (which feeds into SVEC) and the time it 
> took to run:
> {code}
> 1,000 rows -> 1/2 sec
> 10,000 rows -> 20 sec
> 100,000 rows -> @15mins we killed the process.
> {code}
> This is not scaling anywhere near linearly and also it is demonstrating 
> severe skew in that only one postgres process on one node is used during this 
> processing.  The query that we're running is:
> {code}
> CREATE TABLE public.tfidf AS (
>    SELECT
>       doc_id as document_id,
>         madlib.svec_mult( sparse_vector, logidf ) tf_idf
>     FROM
>         public.corpus,
>         ( SELECT madlib.svec_log(
>             madlib.svec_div(
>           count(sparse_vector)::madlib.svec,
>             madlib.svec_count_nonzero(sparse_vector)
>             )
>         ) logidf FROM public.corpus ) foo
>     --ORDER BY document_id
> ) DISTRIBUTED BY (document_id)
> {code}
> After some investigation, we determined that the madlib.svec_mult() is the 
> performance bottleneck here.  The internal select that calls svec_div() and 
> svec_count_nonzero() runs relatively quickly.  We're looking for 
> guidance/help on (a) the skew issue and (b) the performance in general 
> because ultimately, we need to scale up to 45M base table rows and that will 
> greatly increase the SVEC size."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to