Frank McQuillan created MADLIB-1019:
---------------------------------------
Summary: Scalability of SVEC
Key: MADLIB-1019
URL: https://issues.apache.org/jira/browse/MADLIB-1019
Project: Apache MADlib
Issue Type: Improvement
Components: Module: Sparse Vectors
Reporter: Frank McQuillan
Entered on behalf of a user doing text analytics work...
We're testing with some MADlib functions (we're running this install of MADlib
madlib-ossv1.9_pv1.9.5_gpdb4.3orca-rhel5-x86_64 ) While testing, we are
running into some performance issues as we try to scale up our data set.
We took a subset of our data and ran on a varying number of rows with the rows
being between 900 and 1000 bytes long. The following bullets show the number
of rows in our base dataset (which feeds into SVEC) and the time it took to run:
{code}
1,000 rows -> 1/2 sec
10,000 rows -> 20 sec
100,000 rows -> @15mins we killed the process.
{code}
This is not scaling anywhere near linearly and also it is demonstrating severe
skew in that only one postgres process on one node is used during this
processing. The query that we're running is:
{code}
CREATE TABLE public.tfidf AS (
SELECT
doc_id as document_id,
madlib.svec_mult( sparse_vector, logidf ) tf_idf
FROM
public.corpus,
( SELECT madlib.svec_log(
madlib.svec_div(
count(sparse_vector)::madlib.svec,
madlib.svec_count_nonzero(sparse_vector)
)
) logidf FROM public.corpus ) foo
--ORDER BY document_id
) DISTRIBUTED BY (document_id)
{code}
After some investigation, we determined that the madlib.svec_mult() is the
performance bottleneck here. The internal select that calls svec_div() and
svec_count_nonzero() runs relatively quickly. We're looking for guidance/help
on (a) the skew issue and (b) the performance in general because ultimately, we
need to scale up to 45M base table rows and that will greatly increase the SVEC
size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)