Hi again Yan,
Sorry about the late reply, the *ParquetOutputFormat* class has a
number of
setters:
public static void setBlockSize(Job job, int blockSize) {
getConfiguration(job).setInt(BLOCK_SIZE, blockSize);
}
public static void setPageSize(Job job, int pageSize) {
getConfiguration(job).setInt(PAGE_SIZE, pageSize);
}
public static void setDictionaryPageSize(Job job, int pageSize) {
getConfiguration(job).setInt(DICTIONARY_PAGE_SIZE, pageSize);
}
public static void setCompression(Job job, CompressionCodecName
compression) {
getConfiguration(job).set(COMPRESSION, compression.name());
}
public static void setEnableDictionary(Job job, boolean
enableDictionary)
{
getConfiguration(job).setBoolean(ENABLE_DICTIONARY,
enableDictionary);
}
these allow you to set the 'row group' (i.e. block size) and page size
which determine how much data is written out per block (and
transitively
how much data is retained in memory before a flush). Try setting
these to
say '128 M' for a block and '1 MB' for a page (to test). If this
doesn't
work can you let us know what the current sizes are that you're using
(using the associated getters also on ParquetOutputFormat)?
Thanks
On Wed, Jan 6, 2016 at 4:23 PM, Yan Qi <[email protected]> wrote:
Hi Reuben,
Thanks for your quick reply! :)
The table has nested columns with the following Avro schema:
{
"namespace": "profile.avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "id", "type": "int"},
{"name": "M1", "type": ["Market", "null"]},
{"name": "M2", "type": ["Market", "null"]},
.......
.......
{"name": "M100", "type": ["Market", "null"]}
]
}
{
"namespace": "profile.avro.parquet.model",
"type": "record",
"name": "Market",
"fields": [
{"name": "item1", "type": [{ "type": "array", "items": "Client"},
"null"]},
{"name": "item2", "type": [{ "type": "array", "items": "Client"},
"null"]},
{"name": "item3", "type": [{ "type": "array", "items": "Client"},
"null"]},
{"name": "item4", "type": [{ "type": "array", "items": "Client"},
"null"]},
{"name": "item5", "type": [{ "type": "array", "items": "Client"},
"null"]}
]
}
{
"namespace": "profile.avro.parquet.model",
"type": "record",
"name": "Client",
"fields": [
{"name": "attribute1", "type": "int"},
{"name": "attribute2", "type": "int"},
{"name": "attribute3", "type": "int"},
......
......
{"name": "attribute50", "type": "int"}
]
}
For each record in the table, it may not have every attribute
valid. For
example, a record of Profile may only have M1, M20 and M89 with
values,
but
other empty. When we tried to write such a record in the parquet
format,
it
requires a lot of memory to get started.
We also tried another way to define the table, like:
{
"namespace": "profile.avro.parquet.model",
"type": "record",
"name": "Profile",
"fields": [
{"name": "id", "type": "int"},
{"name": "markets", "type": [{ "type": "array",
"items":
"Market"}, "null"]},
]
}
Interestingly it can handle the same data with much smaller memory.
But
we
won't be able to get the columnar storage benefits for those Market
members
because we have to load data from all markets no matter what market is
concerned.
Hope my information could give you a rough idea of the application.
So my
question is if increasing the memory size is the only way in the
former
case, or if there is a better way to define the table.
Best regards,
Yan
On Wed, Jan 6, 2016 at 12:03 PM, Reuben Kuhnert <
[email protected]
wrote:
Hi Yan,
So the primary concern here would be the 'row group' size that you're
using
for your table. The row group is basically what determines how much
information is stored in memory before being flushed to disk (this
becomes
an even greater issue if you have multiple parquet files open
simultaneously as well - obviously). If you could, can you share some
of
the stats about your file with us? See if we can't get you moving
again.
Thanks
Reuben
On Wed, Jan 6, 2016 at 1:54 PM, Yan Qi <[email protected]> wrote:
We are trying to create a large table in Parquet. The table has
up to
thousands of columns, but its record may not be large because
many of
the
columns are empty. We are using Avro-Parquet for data
serialization/de-serialization. However, we got out-of-memory issue
when
writing the data in the Parquet format.
Our understanding is that Parquet may keep an internal structure for
the
table schema, which may take more memory if the table becomes
larger.
If
that's the case, our question is:
Is there a limit to the table size that Parquet can support? If yes,
how
could we determine the limit?
Thanks,
Yan