date:20120514

[jira] [Created] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)

Vivek Nadkarni created AVRO-1088:


 Summary: Avro-C - Add performance tests for schema resolution and 
arrays.
 Key: AVRO-1088
 URL: https://issues.apache.org/jira/browse/AVRO-1088
 Project: Avro
  Issue Type: Improvement
  Components: c
Affects Versions: 1.7.0
 Environment: Ubuntu Linux 11.10
Reporter: Vivek Nadkarni
 Fix For: 1.7.0


The current performance test in Avro-C measures the performance while
reading and writing of Avro values using a complex record schema,
which does not contain any arrays.

We add tests to measure the performance for simple and nested
arrays. We also replicate all tests to measure the performance of the
schema resolution using a resolved reader and a resolved writer.

Specifically we add the following performance tests:

Nested Record
1. Replicating the test nested record value by index, using a helper
   function. Using helper functions adds a little overhead, but it
   allows us to test various schemas, as well as different modes of
   schema resolution much more easily.
2. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a complex record.
3. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a complex record.

Simple Array
4. Test the performance for reading and writing a simple array.
5. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a simple array.
6. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a simple array.

Nested Array
7. Test the performance for reading and writing a nested array.
8. Using a resolved writer to resolve between (identical) reader and
   writer schemas, while reading a nested array.
9. Using a resolved reader to resolve between (identical) reader and
   writer schemas, while writing a nested array.

Additionally we fix a minor bug:
1. The return value of avro_value_equal_fast() was not being
   tested. Test this return value, and fail if it is FALSE.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vivek Nadkarni updated AVRO-1088:
-

Attachment: AVRO-1088.patch

Uploading patch file implementing the new performance tests.

Avro-C - Add performance tests for schema resolution and arrays.

Key: AVRO-1088
URL: https://issues.apache.org/jira/browse/AVRO-1088
Project: Avro
Issue Type: Improvement
Components: c
Affects Versions: 1.7.0
Environment: Ubuntu Linux 11.10
Reporter: Vivek Nadkarni
Fix For: 1.7.0

Attachments: AVRO-1088.patch

Original Estimate: 24h
Remaining Estimate: 24h

The current performance test in Avro-C measures the performance while
reading and writing of Avro values using a complex record schema,
which does not contain any arrays.
We add tests to measure the performance for simple and nested
arrays. We also replicate all tests to measure the performance of the
schema resolution using a resolved reader and a resolved writer.
Specifically we add the following performance tests:
Nested Record
1. Replicating the test nested record value by index, using a helper
function. Using helper functions adds a little overhead, but it
allows us to test various schemas, as well as different modes of
schema resolution much more easily.
2. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a complex record.
3. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a complex record.
Simple Array
4. Test the performance for reading and writing a simple array.
5. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a simple array.
6. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a simple array.
Nested Array
7. Test the performance for reading and writing a nested array.
8. Using a resolved writer to resolve between (identical) reader and
writer schemas, while reading a nested array.
9. Using a resolved reader to resolve between (identical) reader and
writer schemas, while writing a nested array.
Additionally we fix a minor bug:
1. The return value of avro_value_equal_fast() was not being
tested. Test this return value, and fail if it is FALSE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

2012-05-14 Thread Vivek Nadkarni (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vivek Nadkarni updated AVRO-1088:
-

Status: Patch Available (was: Open)

I ran the performance tests and got the results appended below.

The results show that, as expected, there is a slight performance hit
for using a resolved writer or resolved reader for the complex record,
compared to using the matched schemas.

However, the results also show that for the simple array and for the
nested array, the penalty for using the resolved writer is
substantial. Using the resolved writer takes 30 to 50 times longer
than using no schema resolution or using the resolved reader for
simple and nested arrays.

The performance results indicate that there is a likely bug in the
resolved writer, when it is trying to resolve simple or nested
arrays. This bug will be reported in a separate AVRO-JIRA issue.

Running refcount
1 tests per run
Run 1
Run 2
Run 3
Average time: 2.423s
Tests/sec:41265475
Running nested record (legacy)
10 tests per run
Run 1
Run 2
Run 3
Average time: 2.270s
Tests/sec:44053
Running nested record (value by index)
100 tests per run
Run 1
Run 2
Run 3
Average time: 2.077s
Tests/sec:481541
Running nested record (value by name)
100 tests per run
Run 1
Run 2
Run 3
Average time: 2.333s
Tests/sec:428571
Running nested record (value by index) matched schemas
100 tests per run
Run 1
Run 2
Run 3
Average time: 2.147s
Tests/sec:465839
Running nested record (value by index) resolved writer
100 tests per run
Run 1
Run 2
Run 3
Average time: 2.480s
Tests/sec:403226
Running nested record (value by index) resolved reader
100 tests per run
Run 1
Run 2
Run 3
Average time: 2.230s
Tests/sec:448430
Running simple array matched schemas
25 tests per run
Run 1
Run 2
Run 3
Average time: 2.123s
Tests/sec:117739
Running simple array resolved writer
1 tests per run
Run 1
Run 2
Run 3
Average time: 2.747s
Tests/sec:3641
Running simple array resolved reader
25 tests per run
Run 1
Run 2
Run 3
Average time: 2.270s
Tests/sec:110132
Running nested array matched schemas
25 tests per run
Run 1
Run 2
Run 3
Average time: 3.030s
Tests/sec:82508
Running nested array resolved writer
1 tests per run
Run 1
Run 2
Run 3
Average time: 6.650s
Tests/sec:1504
Running simple array resolved reader
25 tests per run
Run 1
Run 2
Run 3
Average time: 3.313s
Tests/sec:75453

Avro-C - Add performance tests for schema resolution and arrays.

Attachments: AVRO-1088.patch

Original Estimate: 24h
Remaining Estimate: 24h

[jira] [Created] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

2012-05-14 Thread Vivek Nadkarni (JIRA)

Vivek Nadkarni created AVRO-1089:


 Summary: Avro-C - Penalty 30x to 50x for using resolved writer on 
arrays
 Key: AVRO-1089
 URL: https://issues.apache.org/jira/browse/AVRO-1089
 Project: Avro
  Issue Type: Bug
  Components: c
Affects Versions: 1.6.3, 1.7.0
 Environment: Ubuntu Linux
Reporter: Vivek Nadkarni
 Fix For: 1.7.0


The new performance tests created in AVRO-1088 show that using the
resolved writer takes 30 to 50 times longer than using no schema
resolution or using the resolved reader for simple and nested arrays.

For a simple array, using the resolved writer took ~30x longer than
using the memory reader that assumed a matching schema. For the nested
array, using the resolved writer took ~50x longer.

These results suggest that there is a bug in resolved writer. I do not
have a proposed fix at this time.


 Running simple array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.123s
  Tests/sec:117739
 Running simple array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 2.747s
  Tests/sec:3641


 Running nested array matched schemas 
  25 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 3.030s
  Tests/sec:82508
 Running nested array resolved writer 
  1 tests per run
  Run 1
  Run 2
  Run 3
  Average time: 6.650s
  Tests/sec:1504



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

2012-05-14 Thread Vivek Nadkarni (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vivek Nadkarni updated AVRO-1089:
-

Attachment: AVRO-1089-performance.png

This screenshot was generated using kcachegrind, after running the
performance test test_simple_array_resolved_writer(). The plot shows
that the majority of the time (97%) is spent in the function
avro_resolved_writer_free_elements() called by
avro_resolved_array_writer_reset(). This information suggests that the
bug lies in one of these two functions. Unfortunately, I still don't
have a mechanism or a fix for this issue. 



 Avro-C - Penalty 30x to 50x for using resolved writer on arrays
 ---

 Key: AVRO-1089
 URL: https://issues.apache.org/jira/browse/AVRO-1089
 Project: Avro
  Issue Type: Bug
  Components: c
Affects Versions: 1.6.3, 1.7.0
 Environment: Ubuntu Linux
Reporter: Vivek Nadkarni
 Fix For: 1.7.0

 Attachments: AVRO-1089-performance.png

   Original Estimate: 48h
  Remaining Estimate: 48h

 The new performance tests created in AVRO-1088 show that using the
 resolved writer takes 30 to 50 times longer than using no schema
 resolution or using the resolved reader for simple and nested arrays.
 For a simple array, using the resolved writer took ~30x longer than
 using the memory reader that assumed a matching schema. For the nested
 array, using the resolved writer took ~50x longer.
 These results suggest that there is a bug in resolved writer. I do not
 have a proposed fix at this time.
  Running simple array matched schemas 
   25 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 2.123s
   Tests/sec:117739
  Running simple array resolved writer 
   1 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 2.747s
   Tests/sec:3641
  Running nested array matched schemas 
   25 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 3.030s
   Tests/sec:82508
  Running nested array resolved writer 
   1 tests per run
   Run 1
   Run 2
   Run 3
   Average time: 6.650s
   Tests/sec:1504

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

Catalin Alexandru Zamfir created AVRO-1090:
--

 Summary: DataFileWriter should expose sync marker to allow 
concurrent writes to same .avro file
 Key: AVRO-1090
 URL: https://issues.apache.org/jira/browse/AVRO-1090
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir


We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing 
this with two threads per node, on 8 nodes. Some of the nodes share the same 
path. For example, our: TimestampedWriter class, takes a path argument and 
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two 
nodes can access the same path. The race condition when these streams are 
written, is resolved with a check to see if the file exists (has been created) 
by a faster thread. If that's so, it appends, instead of creating the file on 
the HDFS.

The problem is that DataFileWriter, generates a 16-byte, random string for each 
instance. So, two threads with 2 different writer instances, have a different 
sync marker. That means that data, when trying to read it back, will get an 
IOException (Invalid sync!).

There's a big performance penalty here. Because only one writer can write at 
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took 
us 4 hours to generate  load. With 20 concurrent threads, it took only 12.5 
minutes. 

If DataFileWriter would expose the sync marker, a developer could read that 
and make sure that the next thread that appends to the file, uses the same sync 
marker. Don't know if it's even possible to expose the sync marker so as other 
instances of DataFileWriter can share the sync marker, from the file. We have 
a fix for this, making sure each writer is an unique instance and generating 
a path based on that uniqueness. But instead of having 
SomePath/2012/05/14/Shard.avro we'd now have 
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that 
write the data in.

If it can be done, it would be a huge fix for a bottleneck problem. The 
bottleneck being the single writer that can write to a single path.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

2012-05-14 Thread Robert Fuller (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Fuller updated AVRO-1081:


Attachment: patch.diff.txt
ByteBufferTest.java

Attached one possible way of fixing this.

I am not 100% happy with the solution, but it should work for our case for now. 
We are writing several avro files concurrently from within a heavily 
multithreaded application, and cannot afford to load many of the files into 
memory at once at that point.

When reading the files again (in hadoop m/r job) it's ok for now to read the 
bytes back to memory (will read them into a byte array, but for the purpose of 
the test wanted to put them into a byte buffer).

Am also now wondering whether you might consider direct support for File (or 
something like FileData) in avro. This would allow a way to include into avro 
items which exceed the amount of available memory. But maybe it's not a 
requirement anybody else would have...

 GenericDatumWriter does not support native ByteBuffers
 --

 Key: AVRO-1081
 URL: https://issues.apache.org/jira/browse/AVRO-1081
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Robert Fuller
 Attachments: ByteBufferTest.java, ByteBufferTest.java, 
 patch.diff.txt, patch.diff.txt


 An exception is thrown when trying to encode bytes backed by a file.
 java.lang.UnsupportedOperationException: null
   at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31]
   at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) 
 ~[avro-1.6.3.jar:1.6.3]
 Note arrayOffset is an optional method, see:
 http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29
 FileChannel returns native ByteBuffer not HeapedByteBuffer
 See here:
 http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

2012-05-14 Thread Robert Fuller (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Fuller updated AVRO-1081:


Attachment: patch.diff.txt
ByteBufferTest.java

Attached one possible way of fixing this.

I am not 100% happy with the solution, but it should work for our case for now. 
We are writing several avro files concurrently from within a heavily 
multithreaded application, and cannot afford to load many of the files into 
memory at once at that point.

When reading the files again (in hadoop m/r job) it's ok for now to read the 
bytes back to memory (will read them into a byte array, but for the purpose of 
the test wanted to put them into a byte buffer).

Am also now wondering whether you might consider direct support for File (or 
something like FileData) in avro. This would allow a way to include into avro 
items which exceed the amount of available memory. But maybe it's not a 
requirement anybody else would have...

 GenericDatumWriter does not support native ByteBuffers
 --

 Key: AVRO-1081
 URL: https://issues.apache.org/jira/browse/AVRO-1081
 Project: Avro
  Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Robert Fuller
 Attachments: ByteBufferTest.java, ByteBufferTest.java, 
 patch.diff.txt, patch.diff.txt


 An exception is thrown when trying to encode bytes backed by a file.
 java.lang.UnsupportedOperationException: null
   at java.nio.ByteBuffer.arrayOffset(ByteBuffer.java:968) ~[na:1.6.0_31]
   at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:61) 
 ~[avro-1.6.3.jar:1.6.3]
 Note arrayOffset is an optional method, see:
 http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html#arrayOffset%28%29
 FileChannel returns native ByteBuffer not HeapedByteBuffer
 See here:
 http://mail-archives.apache.org/mod_mbox/avro-user/201202.mbox/%3ccb57f421.6bfe2%25sc...@richrelevance.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274584#comment-13274584
]

Catalin Alexandru Zamfir commented on AVRO-1090:

From the source-code of DatumFileWriter, the appendTo method seems to do
what it's intended. But only accepts a (File) argument. In the case where one
writer connects over the network to Hadoop and needs to write a
FSDataOutputStream instead of a file, the advantages of the appendTo method
cannot be used. So it seems it is possible to retrieve the sync marker from an
existing .avro file and write forward with the same marker.

Can this be done here also? Can an appendTo (FSDataOutputStream) method be
created? This would allow concurrent writers to create or append on the same
output stream using the same marker, thus enabling the data to be read back.

DataFileWriter should expose sync marker to allow concurrent writes to same
.avro file

Key: AVRO-1090
URL: https://issues.apache.org/jira/browse/AVRO-1090
Project: Avro
Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing
this with two threads per node, on 8 nodes. Some of the nodes share the same
path. For example, our: TimestampedWriter class, takes a path argument and
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or
two nodes can access the same path. The race condition when these streams
are written, is resolved with a check to see if the file exists (has been
created) by a faster thread. If that's so, it appends, instead of creating
the file on the HDFS.
The problem is that DataFileWriter, generates a 16-byte, random string for
each instance. So, two threads with 2 different writer instances, have a
different sync marker. That means that data, when trying to read it back,
will get an IOException (Invalid sync!).
There's a big performance penalty here. Because only one writer can write at
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it
took us 4 hours to generate load. With 20 concurrent threads, it took only
12.5 minutes.
If DataFileWriter would expose the sync marker, a developer could read that
and make sure that the next thread that appends to the file, uses the same
sync marker. Don't know if it's even possible to expose the sync marker so as
other instances of DataFileWriter can share the sync marker, from the file.
We have a fix for this, making sure each writer is an unique instance and
generating a path based on that uniqueness. But instead of having
SomePath/2012/05/14/Shard.avro we'd now have
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers
that write the data in.
If it can be done, it would be a huge fix for a bottleneck problem. The
bottleneck being the single writer that can write to a single path.

[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274592#comment-13274592
]

Catalin Alexandru Zamfir commented on AVRO-1090:

Also, we're doing objRecordWriter.create (getHdfs ().append (objPath)) which
should make a DataFileWriter on the FSDataOutputStream which respects the
first sync marker of the written file. So: if thread #1 has already created the
file, thread #2 can now append to the given path. But because it appends, it
does not need to generateSync on the sync marker. Instead, it can read the
sync marker from the already generated file and use it as it's own sync marker.

This does not happen. The fact that we get an Invalid Sync because of the
fact we are creating multiple writers in different threads, even if one thread
finishes first and create the given path, the next thread that should append
to it does not seem to take in account the fact that it should first read the
existing sync marker defined with the file. DataFileWriter should take in
account that a file/path/stream written by it already contains a sync marker,
and that there's no need to generate another one. It should get the existing
sync marker and use that to append the data to the given HDFS path.

Thanks for your patience.

DataFileWriter should expose sync marker to allow concurrent writes to same
.avro file

Key: AVRO-1090
URL: https://issues.apache.org/jira/browse/AVRO-1090
Project: Avro
Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thiruvalluvan M. G. updated AVRO-1079:
--

Assignee: Thiruvalluvan M. G.
Status: Patch Available (was: Open)

C++ Generator, improve include guard generation
---

Key: AVRO-1079
URL: https://issues.apache.org/jira/browse/AVRO-1079
Project: Avro
Issue Type: Improvement
Components: c++
Affects Versions: 1.6.3, 1.7.0
Reporter: falk.trist...@cae.de
Assignee: Thiruvalluvan M. G.
Priority: Minor
Attachments: AVRO-1079.patch

Original Estimate: 5h
Remaining Estimate: 5h

We integrated avro into our cmake build system. So we have json - c++ header
build step which is quite similar to the QT moc-compiler. This build step
will overwrite the target header-files if the json schema file was modified
or when the generated header files are different.
In addition to that we want to put the generated header files under version
control.
However, the generated include-guards contains some random parts. This random
part troubles the version control system. In addition to that this random
part leads to unnecessary rebuilds as cmake 'thinks' important files have
been changed and triggers a rebuild of the corresponding dependencies.
Suggestion:
Add an additional command line parameter to either switch off the random part
of the string, or to take a string from the command line and use this as
include guard.

[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thiruvalluvan M. G. updated AVRO-1079:
--

Attachment: AVRO-1079.patch

This patch tries to reuse the include guard if the output file already exists.

C++ Generator, improve include guard generation
---

Original Estimate: 5h
Remaining Estimate: 5h

[jira] [Updated] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

2012-05-14 Thread Catalin Alexandru Zamfir (JIRA)

[
https://issues.apache.org/jira/browse/AVRO-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Catalin Alexandru Zamfir updated AVRO-1090:
---

Description:
We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing
this with two threads per node, on 8 nodes. Some of the nodes share the same
path. For example, our: TimestampedWriter class, takes a path argument and
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two
nodes can access the same path. The race condition when these streams are
written, is resolved with a check to see if the file exists (has been created)
by a faster thread. If that's so, it appends, instead of creating the file on
the HDFS.

The problem is that DataFileWriter, generates a 16-byte, random string for each
instance. So, two threads with 2 different writer instances, have a different
sync marker. That means that data, when trying to read it back, will get an
IOException (Invalid sync!).

There's a big performance penalty here. Because only one writer can write at
once to one given path, it becomes a bottleneck. For 1B (billion) rows, it took
us 4 hours to generate load. With 20 concurrent threads, it took only 12.5
minutes.

If DataFileWriter would expose the sync marker, a developer could read that
and make sure that the next thread that appends to the file, uses the same sync
marker. Don't know if it's even possible to expose the sync marker so as other
instances of DataFileWriter can share the sync marker, from the file. We have
a fix for this, making sure each writer is an unique instance and generating
a path based on that uniqueness. But instead of having
SomePath/2012/05/14/Shard.avro we'd now have
SomePath/2012/05/14/Shard-some-random-UUID.avro for each of the writers that
write the data in.

If it can be done, it would be a huge fix for a bottleneck problem. The
bottleneck being the single writer that can write to a single path.

THIS HAS ALSO been requested on the avro-user thread:
http://grokbase.com/t/avro/user/122m4sjm1y/is-it-possible-to-append-to-an-already-existing-avro-file
I just could not find the JIRA ticket for this request.

was:
We're writing to Hadoop via DataFileWriter (FSDataOutputStream). We're doing
this with two threads per node, on 8 nodes. Some of the nodes share the same
path. For example, our: TimestampedWriter class, takes a path argument and
appends the timestamp to it (ex: SomePath/2012/05/14). Thus, two threads or two
nodes can access the same path. The race condition when these streams are
written, is resolved with a check to see if the file exists (has been created)
by a faster thread. If that's so, it appends, instead of creating the file on
the HDFS.

If it can be done, it would be a huge fix for a bottleneck problem. The
bottleneck being the single writer that can write to a single path.

DataFileWriter should expose sync marker to allow concurrent writes to same
.avro file

Key: AVRO-1090
URL: https://issues.apache.org/jira/browse/AVRO-1090
Project: Avro
Issue Type: Bug
Affects Versions: 1.6.3
Reporter: Catalin Alexandru Zamfir

[jira] [Commented] (AVRO-1085) Fingerprinting for C#

2012-05-14 Thread Thiruvalluvan M. G. (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274743#comment-13274743
 ] 

Thiruvalluvan M. G. commented on AVRO-1085:
---

+1. Looks good to me.

Built and tested on Windows 7, Visual C# Express 2010.

 Fingerprinting for C#
 -

 Key: AVRO-1085
 URL: https://issues.apache.org/jira/browse/AVRO-1085
 Project: Avro
  Issue Type: New Feature
  Components: csharp
Affects Versions: 1.7.0
Reporter: Eric Hauser
 Fix For: 1.7.0

 Attachments: AVRO-1085.patch


 Avro fingerprinting for C#.  This is a direct port of the Java 
 SchemaNormalization class and tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

[jira] [Updated] (AVRO-1088) Avro-C - Add performance tests for schema resolution and arrays.

[jira] [Created] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

[jira] [Updated] (AVRO-1089) Avro-C - Penalty 30x to 50x for using resolved writer on arrays

[jira] [Created] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

[jira] [Updated] (AVRO-1081) GenericDatumWriter does not support native ByteBuffers

[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

[jira] [Commented] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

[jira] [Updated] (AVRO-1079) C++ Generator, improve include guard generation

[jira] [Updated] (AVRO-1090) DataFileWriter should expose sync marker to allow concurrent writes to same .avro file

[jira] [Commented] (AVRO-1085) Fingerprinting for C#

14 matches

Site Navigation

Mail list logo

Footer information