[jira] [Updated] (DRILL-4740) Improvements to "Analyzing the Yelp Academic Dataset"

Paul Rogers (JIRA) Tue, 21 Jun 2016 11:25:42 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated DRILL-4740:
-------------------------------
    Description: 
Consider the topic paragraph for the Yelp sample data page: 
http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/

It could use a bit of TLC. For example:

"Apache Drill is one of the fastest growing open source projects, with the 
community making rapid progress with monthly releases The key difference is 
Drill’s agility and flexibility."

This is a non-sequiter. The speed and agility of the software does not drive 
the monthly releases. Can we reword it to say that Drill’s speed and agility 
makes it a popular project? And that many people work hard to make it better 
with monthly releases? Something like that...

(Although, at present, releases have dropped to bi-monthly or quarterly...)

And:

"Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low 
latency performance at scale, …"

Seems two problems.

1. What does it mean “meeting the table stakes”? Very unclear.
2. This is a run-on sentence that tries to say multiple thoughts in a single 
sentence and should be rewritten.

Then, there is redundancy:

"...Drill allows users to analyze the data without any ETL or up-front schema 
definitions. … Drill, has a “no schema” approach…"

I’m sure this paragraph was written quickly early on, but it could certainly be 
improved a bit…

More comments:

1. Minor nit: "This document aligns Drill output for example purposes. Drill 
output is not aligned in this case."

I think that what this is saying is, “Drill output in this document is aligned 
for clarity. The actual Drill output you see may not be aligned.”

It would be better to explain why it is not aligned here, since data is aligned 
in the earlier examples…

2.  Somewhat off: "You can directly query self-describing files such as JSON, 
Parquet, and text. There is no need to create metadata definitions in the Hive 
metastore."

I think what this is saying is that Drill infers schema information from 
self-describing files such as JSON, Parquet and CSV/TSV (with a header row). 
Contrast this with other systems, such as Hive, that require that you first 
define the schema in a data dictionary.

Note that text is NOT a self-describing file format in the general case!

3.  Yelp seems to be creating new revisions of their data set. I downloaded 
Round 7. The results differ from those in the Drill page text. Perhaps insert a 
statement that the examples used Round (whatever round) and that the reader’s 
results may differ when using later rounds.

4.  The Yelp data is JSON. Somewhere near the top of the page (perhaps directly 
under "Querying Data with Drill”),  we should say:

The Yelp data is in JSON format.

Where the “JSON format” would be link to the JSON docs: 
https://drill.apache.org/docs/json-data-model/

This is handy later when we tell the user to set the all_text_mode:

First, change Drill to work in all text mode (so we can take a look at all of 
the data).

Where we should add: (See the JSON Data Model documentation for more 
information.)

5. This query:

select attributes from 
dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 10;

Appears all on one line and is truncated at the right of the page. Looks like 
we’ve broken our other long queries onto multiple lines. Perhaps this one needs 
the same treatment.

7. Here: "Top first categories in number of review counts"

Perhaps copy the following text from the JSON format page to add explanation:

“Query Complex Data” show how to use composite types to access nested arrays.

8. Another nit. Consider "Top businesses with cool rated reviews”. This (and 
similar items) are headers, but appear as regular text. The items have the HTML 
h4 tag, but have no special formatting. Can we make them bold or some such?

9. The following example SQL has two problems:

0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as 
Select b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, 
r.`date` 
from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` b, 
dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r 
where r.business_id=b.business_id

First, the third line scrolls off the page on my (moderate sized) page. Perhaps 
split it after “b, “.

Second, the statement must end with a semi-colon: “b.business_id;”.

10. Another nit. This paragraph:

"The goal of Apache Drill is to provide the freedom and flexibility in 
exploring data in ways we have never seen before with SQL technologies. The 
community is working on more exciting features around nested data and 
supporting data with changing schemas in upcoming releases."

Would seem to be a better fit at the top of the page rather than toward the end.

11. Another nit. This paragraph:

"In addition to these queries, you can get many deep insights using Drill’s SQL 
functionality. If you are not comfortable with writing queries manually, you 
can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw 
files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC 
drivers."

Seems like a fine summary. However, it is currently awkwardly placed between 
two examples. I’ll guess that the FLATTEN example was added later. Perhaps move 
this paragraph to the end before “Stay tuned…” so that it is returned to its 
role as a summary.

12. Then, reverse the order of the following two items: "The FLATTEN function 
can be used to dynamically rationalize semi-structured data so you can apply 
even deeper SQL functionality. Here is a sample query:"

"Get a flattened list of categories for each business"

Once the “Get a flattened…” text looks like a header, the “The FLATTEN 
function…” text is a nice introductory paragraph.

13. At the end of the page we have a list of links. The list could use some 
work. The first item (download) shows the link separate from the label, the 
others have the link tied to the label. Can we make this consistent?

Also, we should link to the topics covered on this page such as JSON format 
(see above for link), File system plugin 
(http://drill.apache.org/docs/file-system-storage-plugin/), and so on.


  was:
Consider the topic paragraph for the Yelp sample data page: 
http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/

It could use a bit of TLC. For example:

"Apache Drill is one of the fastest growing open source projects, with the 
community making rapid progress with monthly releases The key difference is 
Drill’s agility and flexibility."

This is a non-sequiter. The speed and agility of the software does not drive 
the monthly releases. Can we reword it to say that Drill’s speed and agility 
makes it a popular project? And that many people work hard to make it better 
with monthly releases? Something like that...

(Although, at present, releases have dropped to bi-monthly or quarterly...)

And:

"Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low 
latency performance at scale, …"

Seems two problems.

1. What does it mean “meeting the table stakes”? Very unclear.
2. This is a run-on sentence that tries to say multiple thoughts in a single 
sentence and should be rewritten.

Then, there is redundancy:

"...Drill allows users to analyze the data without any ETL or up-front schema 
definitions. … Drill, has a “no schema” approach…"

I’m sure this paragraph was written quickly early on, but it could certainly be 
improved a bit…


> Improvements to "Analyzing the Yelp Academic Dataset"
> -----------------------------------------------------
>
>                 Key: DRILL-4740
>                 URL: https://issues.apache.org/jira/browse/DRILL-4740
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 1.6.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider the topic paragraph for the Yelp sample data page: 
> http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/
> It could use a bit of TLC. For example:
> "Apache Drill is one of the fastest growing open source projects, with the 
> community making rapid progress with monthly releases The key difference is 
> Drill’s agility and flexibility."
> This is a non-sequiter. The speed and agility of the software does not drive 
> the monthly releases. Can we reword it to say that Drill’s speed and agility 
> makes it a popular project? And that many people work hard to make it better 
> with monthly releases? Something like that...
> (Although, at present, releases have dropped to bi-monthly or quarterly...)
> And:
> "Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve 
> low latency performance at scale, …"
> Seems two problems.
> 1. What does it mean “meeting the table stakes”? Very unclear.
> 2. This is a run-on sentence that tries to say multiple thoughts in a single 
> sentence and should be rewritten.
> Then, there is redundancy:
> "...Drill allows users to analyze the data without any ETL or up-front schema 
> definitions. … Drill, has a “no schema” approach…"
> I’m sure this paragraph was written quickly early on, but it could certainly 
> be improved a bit…
> More comments:
> 1. Minor nit: "This document aligns Drill output for example purposes. Drill 
> output is not aligned in this case."
> I think that what this is saying is, “Drill output in this document is 
> aligned for clarity. The actual Drill output you see may not be aligned.”
> It would be better to explain why it is not aligned here, since data is 
> aligned in the earlier examples…
> 2.  Somewhat off: "You can directly query self-describing files such as JSON, 
> Parquet, and text. There is no need to create metadata definitions in the 
> Hive metastore."
> I think what this is saying is that Drill infers schema information from 
> self-describing files such as JSON, Parquet and CSV/TSV (with a header row). 
> Contrast this with other systems, such as Hive, that require that you first 
> define the schema in a data dictionary.
> Note that text is NOT a self-describing file format in the general case!
> 3.  Yelp seems to be creating new revisions of their data set. I downloaded 
> Round 7. The results differ from those in the Drill page text. Perhaps insert 
> a statement that the examples used Round (whatever round) and that the 
> reader’s results may differ when using later rounds.
> 4.  The Yelp data is JSON. Somewhere near the top of the page (perhaps 
> directly under "Querying Data with Drill”),  we should say:
> The Yelp data is in JSON format.
> Where the “JSON format” would be link to the JSON docs: 
> https://drill.apache.org/docs/json-data-model/
> This is handy later when we tell the user to set the all_text_mode:
> First, change Drill to work in all text mode (so we can take a look at all of 
> the data).
> Where we should add: (See the JSON Data Model documentation for more 
> information.)
> 5. This query:
> select attributes from 
> dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 
> 10;
> Appears all on one line and is truncated at the right of the page. Looks like 
> we’ve broken our other long queries onto multiple lines. Perhaps this one 
> needs the same treatment.
> 7. Here: "Top first categories in number of review counts"
> Perhaps copy the following text from the JSON format page to add explanation:
> “Query Complex Data” show how to use composite types to access nested arrays.
> 8. Another nit. Consider "Top businesses with cool rated reviews”. This (and 
> similar items) are headers, but appear as regular text. The items have the 
> HTML h4 tag, but have no special formatting. Can we make them bold or some 
> such?
> 9. The following example SQL has two problems:
> 0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as 
> Select 
> b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, 
> r.`date` 
> from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` 
> b, dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r 
> where r.business_id=b.business_id
> First, the third line scrolls off the page on my (moderate sized) page. 
> Perhaps split it after “b, “.
> Second, the statement must end with a semi-colon: “b.business_id;”.
> 10. Another nit. This paragraph:
> "The goal of Apache Drill is to provide the freedom and flexibility in 
> exploring data in ways we have never seen before with SQL technologies. The 
> community is working on more exciting features around nested data and 
> supporting data with changing schemas in upcoming releases."
> Would seem to be a better fit at the top of the page rather than toward the 
> end.
> 11. Another nit. This paragraph:
> "In addition to these queries, you can get many deep insights using Drill’s 
> SQL functionality. If you are not comfortable with writing queries manually, 
> you can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw 
> files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC 
> drivers."
> Seems like a fine summary. However, it is currently awkwardly placed between 
> two examples. I’ll guess that the FLATTEN example was added later. Perhaps 
> move this paragraph to the end before “Stay tuned…” so that it is returned to 
> its role as a summary.
> 12. Then, reverse the order of the following two items: "The FLATTEN function 
> can be used to dynamically rationalize semi-structured data so you can apply 
> even deeper SQL functionality. Here is a sample query:"
> "Get a flattened list of categories for each business"
> Once the “Get a flattened…” text looks like a header, the “The FLATTEN 
> function…” text is a nice introductory paragraph.
> 13. At the end of the page we have a list of links. The list could use some 
> work. The first item (download) shows the link separate from the label, the 
> others have the link tied to the label. Can we make this consistent?
> Also, we should link to the topics covered on this page such as JSON format 
> (see above for link), File system plugin 
> (http://drill.apache.org/docs/file-system-storage-plugin/), and so on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-4740) Improvements to "Analyzing the Yelp Academic Dataset"

Reply via email to