How to set default maximum size of bloom filter?

2018-06-25 Thread 俊杰陈
Hi devs

I'm now implementing bloom filter feature and need to set a default maximum
value for bloom filter size for a block. According to calculation here
,
I plan to set maximum size to 1/8 of parquet.block.size which can achieve
about 0.25 FPP in case of only one column of long type in a block and all
values are different.  What do you think about this?  Any feedback is
welcome.

-- 
Thanks & Best Regards


[jira] [Commented] (PARQUET-1300) [C++] Parquet modular encryption

2018-06-25 Thread Tham (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523113#comment-16523113
 ] 

Tham commented on PARQUET-1300:
---

[~mdeepak] [~gershinsky] I've just discovered this ticket. I'm also working on 
it (for my project) and almost done. May I know your progress and is there 
anyway to co-work then we can have this feature sooner?

> [C++] Parquet modular encryption
> 
>
> Key: PARQUET-1300
> URL: https://issues.apache.org/jira/browse/PARQUET-1300
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Deepak Majeti
>Priority: Major
> Attachments: column_reader.cc, column_writer.cc, file_reader.cc, 
> file_writer.cc
>
>
> CPP version of a mechanism for modular encryption and decryption of Parquet 
> files. Allows to keep the data fully encrypted in the storage, while enabling 
> a client to extract a required subset (footer, column(s), pages) and to 
> authenticate / decrypt the extracted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1334) [C++] memory_map parameter seems missleading in parquet file opener

2018-06-25 Thread Philipp Hoch (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Hoch updated PARQUET-1334:
--
Affects Version/s: (was: 1.9.0)
   cpp-1.4.0

> [C++] memory_map parameter seems missleading in parquet file opener
> ---
>
> Key: PARQUET-1334
> URL: https://issues.apache.org/jira/browse/PARQUET-1334
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Philipp Hoch
>Priority: Major
>
> If memory_map parameter is true, normal file operation is executed, while in 
> negative case, the according memory mapped file operation happens. Seems 
> either be used via inverted logic or being bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1333) [C++] Reading of files with dictionary size 0 fails on Windows with bad_alloc

2018-06-25 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1333:
-
Fix Version/s: cpp-1.5.0

> [C++] Reading of files with dictionary size 0 fails on Windows with bad_alloc
> -
>
> Key: PARQUET-1333
> URL: https://issues.apache.org/jira/browse/PARQUET-1333
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
> Environment: Microsoft Windows 10 Pro with latest arrow master.
>Reporter: Philipp Hoch
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Account for total_size being 0, having no dictionary entries to allocate for.
> The call with size 0 ends up in arrows memory_pool, 
> [https://github.com/apache/arrow/blob/884474ca5ca1b8da55c0b23eb7cb784c2cd9bdb4/cpp/src/arrow/memory_pool.cc#L50],
>  and the according allocation fails. See according documentation, 
> [https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc].
>  Only happens on Windows environment, as posix_memalign seems to handle 0 
> inputs in unix environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1338) PrimitiveType.equals throw NPE

2018-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522204#comment-16522204
 ] 

ASF GitHub Bot commented on PARQUET-1338:
-

wangyum opened a new pull request #498: PARQUET-1338: Fix PrimitiveType.equals 
throw NPE
URL: https://github.com/apache/parquet-mr/pull/498
 
 
   Fix `PrimitiveType.equals` throw NPE.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> PrimitiveType.equals throw NPE
> --
>
> Key: PARQUET-1338
> URL: https://issues.apache.org/jira/browse/PARQUET-1338
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> Error message:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.parquet.schema.PrimitiveType.equals(PrimitiveType.java:614)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1338) PrimitiveType.equals throw NPE

2018-06-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1338:

Labels: pull-request-available  (was: )

> PrimitiveType.equals throw NPE
> --
>
> Key: PARQUET-1338
> URL: https://issues.apache.org/jira/browse/PARQUET-1338
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> Error message:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.parquet.schema.PrimitiveType.equals(PrimitiveType.java:614)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1338) PrimitiveType.equals throw NPE

2018-06-25 Thread Yuming Wang (JIRA)
Yuming Wang created PARQUET-1338:


 Summary: PrimitiveType.equals throw NPE
 Key: PARQUET-1338
 URL: https://issues.apache.org/jira/browse/PARQUET-1338
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.1
Reporter: Yuming Wang


Error message:
{noformat}
java.lang.NullPointerException
at 
org.apache.parquet.schema.PrimitiveType.equals(PrimitiveType.java:614)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1338) PrimitiveType.equals throw NPE

2018-06-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned PARQUET-1338:


Assignee: Yuming Wang

> PrimitiveType.equals throw NPE
> --
>
> Key: PARQUET-1338
> URL: https://issues.apache.org/jira/browse/PARQUET-1338
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Error message:
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.parquet.schema.PrimitiveType.equals(PrimitiveType.java:614)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1336) PrimitiveComparator should implements Serializable

2018-06-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated PARQUET-1336:
-
Summary: PrimitiveComparator should implements Serializable   (was: 
BinaryComparator should implements Serializable )

> PrimitiveComparator should implements Serializable 
> ---
>
> Key: PARQUET-1336
> URL: https://issues.apache.org/jira/browse/PARQUET-1336
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [info] Cause: java.lang.RuntimeException: java.io.NotSerializableException: 
> org.apache.parquet.schema.PrimitiveComparator$8
> [info] at 
> org.apache.parquet.hadoop.ParquetInputFormat.setFilterPredicate(ParquetInputFormat.java:211)
> [info] at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:399)
> [info] at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:349)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
> [info] at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1791)
> [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
> [info] at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1162)
> [info] at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
> [info] at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
> [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> [info] at org.apache.spark.scheduler.Task.run(Task.scala:109)
> [info] at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:367)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Estimated row-group size is significantly higher than the written one

2018-06-25 Thread Gabor Szadovszky
Thanks a lot, Ryan. Created the JIRA PARQUET-1337 to track it.

On Sat, Jun 23, 2018 at 1:29 AM Ryan Blue  wrote:

> I think you're right about the cause. The current estimate is what is
> buffered in memory, so it includes all of the intermediate data for the
> last page before it is finalized and compressed.
>
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
>
> rb
>
> On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky  wrote:
>
> > Hi All,
> >
> > One of our customers faced the following issue. parquet.block.size is
> > configured to 128M. (parquet.writer.max-padding is left with the default
> > 8M.) In average 7 row-groups are generated in one block with the sizes
> > ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g.
> 60M
> > only one row-group per block is written but it is a waste of disk space.
> > By investigating the logs it turns out that parquet-mr thinks the
> row-group
> > is already close to 128M so it writes the first one then realize we still
> > have space to write until reaching the block size and so on:
> > INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
> > 134,217,728: flushing 484,972 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 >
> 59,814,925:
> > flushing 99,030 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 >
> 43,397,248:
> > flushing 71,848 records to disk.
> > ...
> >
> > My idea about the root cause is that there are many dictionary encoded
> > columns where the value variance is low. When we are approximating the
> > row-group size there are pages which are still open (not encoded yet). If
> > these pages are dictionary encoded we calculate with 4bytes values as the
> > dictionary indexes. But if the variance is low, the RLE and bitpacking
> will
> > decrease the size of these pages dramatically.
> >
> > What do you guys think? Are we able to make the approximation a bit
> better?
> > Do we have some properties that can solve this issue?
> >
> > Thanks a lot,
> > Gabor
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Created] (PARQUET-1337) Implement better estimate of page size for RLE+bitpacking

2018-06-25 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1337:
-

 Summary: Implement better estimate of page size for RLE+bitpacking
 Key: PARQUET-1337
 URL: https://issues.apache.org/jira/browse/PARQUET-1337
 Project: Parquet
  Issue Type: Improvement
Reporter: Gabor Szadovszky


If there are many columns with encoding RLE+bitpacking (e.g. dictionary 
encoding) where the value variance is low the estimate of the size of the open 
pages (which are not encoded yet) are much larger than the final page size. 
Because of that parquet-mr fails to create row-groups that size are close to 
{{parquet.block.size}} which causes performance issues while reading.

A hint from Ryan to solve this issue:
{quote}
We could probably get a better estimate by using the amount of buffered
data and how large other pages in a column were after fully encoding and
compressing. So if you have 5 pages compressed and buffered, and another
1000 values, use the compression ratio of the 5 pages to estimate the final
size. We'd probably want to use some overhead value for the header. And,
we'd want to separate the amount of buffered data from our row group size
estimate, which are currently the same thing.
{quote}

(So, it is not only about RLE+bitpacking but any kind of encoding which is done 
only after "closing" a page.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1312) Improve logical types documentation

2018-06-25 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1312:

Labels: pull-request-available  (was: )

> Improve logical types documentation
> ---
>
> Key: PARQUET-1312
> URL: https://issues.apache.org/jira/browse/PARQUET-1312
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>  Labels: pull-request-available
>
> Logical types 
> [documentation|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  should be updated with the new type parameters introduced with the new 
> logical types API (see details in PARQUET-1253 and PARQUET-906)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1312) Improve logical types documentation

2018-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521893#comment-16521893
 ] 

ASF GitHub Bot commented on PARQUET-1312:
-

gszadovszky closed pull request #98: PARQUET-1312: Improve logical types 
documentation
URL: https://github.com/apache/parquet-format/pull/98
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/LogicalTypes.md b/LogicalTypes.md
index 762769e7..3be6f211 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -29,17 +29,41 @@ This file contains the specification for all logical types.
 
 ### Metadata
 
-The parquet format's `ConvertedType` stores the type annotation. The annotation
+The parquet format's `LogicalType` stores the type annotation. The annotation
 may require additional metadata fields, as well as rules for those fields.
+There is an older representation of the logical type annotations called 
`ConvertedType`.
+To support backward compatibility with old files, readers should interpret 
`LogicalTypes`
+in the same way as `ConvertedType`, and writers should populate 
`ConvertedType` in the metadata
+according to well defined conversion rules.
+
+### Compatibility
+
+The Thrift definition of the metadata has two fields for logical types: 
`ConvertedType` and `LogicalType`.
+`ConvertedType` is an enum of all available annotation. Since Thrift enums 
can't have additional type parameters,
+it is cumbersome to define additional type parameters, like decimal scale and 
precision
+(which are additional 32 bit integer fields on SchemaElement, and are relevant 
only for decimals) or time unit
+and UTC adjustment flag for Timestamp types. To overcome this problem, a new 
logical type representation was introduced into
+the metadata to replace `ConvertedType`: `LogicalType`.  The new 
representation is a union of struct of logical types,
+this way allowing more flexible API, logical types can have type parameters.
+
+However, to maintain compatibility, Parquet readers should be able to read
+and interpret old logical type representation (in case the new one is not 
present,
+because the file was written by older writer), and write `ConvertedType` field 
for old readers.
+
+Compatibility considerations are mentioned for each annotation in the 
corresponding section.
 
 ## String Types
 
-### UTF8
+### STRING
 
-`UTF8` may only be used to annotate the binary primitive type and indicates
+`STRING` may only be used to annotate the binary primitive type and indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.
 
-The sort order used for `UTF8` strings is unsigned byte-wise comparison.
+The sort order used for `STRING` strings is unsigned byte-wise comparison.
+
+*Compatibility*
+
+`STRING` corresponds to `UTF8` ConvertedType.
 
 ### ENUM
 
@@ -65,17 +89,21 @@ The sort order used for `UUID` values is unsigned byte-wise 
comparison.
 
 ### Signed Integers
 
-`INT_8`, `INT_16`, `INT_32`, and `INT_64` annotations can be used to specify
-the maximum number of bits in the stored value.  Implementations may use these
-annotations to produce smaller in-memory representations when reading data.
+`INT` annotation can be used to specify the maximum number of bits in the 
stored value.
+The annotation has two parameter: bit width and sign.
+Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or 
`false`.
+For signed integers, the second parameter should be `true`,
+for example, a signed integer with bit width of 8 is defined as `INT(8, true)`
+Implementations may use these annotations to produce smaller
+in-memory representations when reading data.
 
 If a stored value is larger than the maximum allowed by the annotation, the
 behavior is not defined and can be determined by the implementation.
 Implementations must not write values that are larger than the annotation
 allows.
 
-`INT_8`, `INT_16`, and `INT_32` must annotate an `int32` primitive type and
-`INT_64` must annotate an `int64` primitive type. `INT_32` and `INT_64` are
+`INT(8, true)`, `INT(16, true)`, and `INT(32, true)` must annotate an `int32` 
primitive type and
+`INT(64, true)` must annotate an `int64` primitive type. `INT(32, true)` and 
`INT(64, true)` are
 implied by the `int32` and `int64` primitive types if no other annotation is
 present and should be considered optional.
 
@@ -83,9 +111,13 @@ The sort order used for signed integer types is signed.
 
 ### Unsigned Integers
 
-`UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to
-specify unsigned integer types, along with a maximum number of bits in the
-stored value. Implementations may use these annotations to produce smaller
+`INT` annotation can be 

[jira] [Commented] (PARQUET-1335) Logical type names in parquet-mr are not consistent with parquet-format

2018-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521892#comment-16521892
 ] 

ASF GitHub Bot commented on PARQUET-1335:
-

gszadovszky closed pull request #496: PARQUET-1335: Logical type names in 
parquet-mr are not consistent with parquet-format
URL: https://github.com/apache/parquet-mr/pull/496
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 
b/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
index e22867aec..84305939f 100644
--- 
a/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
+++ 
b/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
@@ -37,7 +37,7 @@ protected LogicalTypeAnnotation fromString(List 
params) {
 return listType();
   }
 },
-UTF8 {
+STRING {
   @Override
   protected LogicalTypeAnnotation fromString(List params) {
 return stringType();
@@ -88,7 +88,7 @@ protected LogicalTypeAnnotation fromString(List 
params) {
 return timestampType(Boolean.parseBoolean(params.get(1)), 
TimeUnit.valueOf(params.get(0)));
   }
 },
-INT {
+INTEGER {
   @Override
   protected LogicalTypeAnnotation fromString(List params) {
 if (params.size() != 2) {
@@ -273,7 +273,7 @@ public void accept(LogicalTypeAnnotationVisitor 
logicalTypeAnnotationVisitor) {
 
 @Override
 LogicalTypeToken getType() {
-  return LogicalTypeToken.UTF8;
+  return LogicalTypeToken.STRING;
 }
 
 @Override
@@ -646,7 +646,7 @@ public void accept(LogicalTypeAnnotationVisitor 
logicalTypeAnnotationVisitor) {
 
 @Override
 LogicalTypeToken getType() {
-  return LogicalTypeToken.INT;
+  return LogicalTypeToken.INTEGER;
 }
 
 @Override
diff --git 
a/parquet-column/src/test/java/org/apache/parquet/parser/TestParquetParser.java 
b/parquet-column/src/test/java/org/apache/parquet/parser/TestParquetParser.java
index 5082501af..1abd56a26 100644
--- 
a/parquet-column/src/test/java/org/apache/parquet/parser/TestParquetParser.java
+++ 
b/parquet-column/src/test/java/org/apache/parquet/parser/TestParquetParser.java
@@ -47,7 +47,7 @@
 
 public class TestParquetParser {
   @Test
-  public void testPaperExample() throws Exception {
+  public void testPaperExample() {
 String example =
 "message Document {\n" +
 "  required int64 DocId;\n" +
@@ -122,7 +122,7 @@ public void testEachPrimitiveType() {
   public void testUTF8Annotation() {
 String message =
 "message StringMessage {\n" +
-"  required binary string (UTF8);\n" +
+"  required binary string (STRING);\n" +
 "}\n";
 
 MessageType parsed = parseMessageType(message);
@@ -139,7 +139,7 @@ public void testUTF8Annotation() {
   public void testIDs() {
 String message =
 "message Message {\n" +
-"  required binary string (UTF8) = 6;\n" +
+"  required binary string (STRING) = 6;\n" +
 "  required int32 i=1;\n" +
 "  required binary s2= 3;\n" +
 "  required binary s3 =4;\n" +
@@ -165,7 +165,7 @@ public void testMAPAnnotations() {
 "message Message {\n" +
 "  optional group aMap (MAP) {\n" +
 "repeated group map (MAP_KEY_VALUE) {\n" +
-"  required binary key (UTF8);\n" +
+"  required binary key (STRING);\n" +
 "  required int32 value;\n" +
 "}\n" +
 "  }\n" +
@@ -192,7 +192,7 @@ public void testLISTAnnotation() {
 String message =
 "message Message {\n" +
 "  required group aList (LIST) {\n" +
-"repeated binary string (UTF8);\n" +
+"repeated binary string (STRING);\n" +
 "  }\n" +
 "}\n";
 
@@ -304,14 +304,14 @@ public void testIntAnnotations() {
   @Test
   public void testIntegerAnnotations() {
 String message = "message IntMessage {" +
-  "  required int32 i8 (INT(8,true));" +
-  "  required int32 i16 (INT(16,true));" +
-  "  required int32 i32 (INT(32,true));" +
-  "  required int64 i64 (INT(64,true));" +
-  "  required int32 u8 (INT(8,false));" +
-  "  required int32 u16 (INT(16,false));" +
-  "  required int32 u32 (INT(32,false));" +
-  "  required int64 u64 (INT(64,false));" +
+  "  required int32 i8 (INTEGER(8,true));" +
+  "  required int32 i16 (INTEGER(16,true));" +
+  "  required int32 i32 (INTEGER(32,true));" +
+  "  required int64 i64 (INTEGER(64,true));" +
+  "  required int32 u8 (INTEGER(8,false));" +
+  "  required int32 u16