[ https://issues.apache.org/jira/browse/DRILL-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers updated DRILL-4740: ------------------------------- Description: Consider the topic paragraph for the Yelp sample data page: http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/ It could use a bit of TLC. For example: "Apache Drill is one of the fastest growing open source projects, with the community making rapid progress with monthly releases The key difference is Drill’s agility and flexibility." This is a non-sequiter. The speed and agility of the software does not drive the monthly releases. Can we reword it to say that Drill’s speed and agility makes it a popular project? And that many people work hard to make it better with monthly releases? Something like that... (Although, at present, releases have dropped to bi-monthly or quarterly...) And: "Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low latency performance at scale, …" Seems two problems. 1. What does it mean “meeting the table stakes”? Very unclear. 2. This is a run-on sentence that tries to say multiple thoughts in a single sentence and should be rewritten. Then, there is redundancy: "...Drill allows users to analyze the data without any ETL or up-front schema definitions. … Drill, has a “no schema” approach…" I’m sure this paragraph was written quickly early on, but it could certainly be improved a bit… More comments: 1. Minor nit: "This document aligns Drill output for example purposes. Drill output is not aligned in this case." I think that what this is saying is, “Drill output in this document is aligned for clarity. The actual Drill output you see may not be aligned.” It would be better to explain why it is not aligned here, since data is aligned in the earlier examples… 2. Somewhat off: "You can directly query self-describing files such as JSON, Parquet, and text. There is no need to create metadata definitions in the Hive metastore." I think what this is saying is that Drill infers schema information from self-describing files such as JSON, Parquet and CSV/TSV (with a header row). Contrast this with other systems, such as Hive, that require that you first define the schema in a data dictionary. Note that text is NOT a self-describing file format in the general case! 3. Yelp seems to be creating new revisions of their data set. I downloaded Round 7. The results differ from those in the Drill page text. Perhaps insert a statement that the examples used Round (whatever round) and that the reader’s results may differ when using later rounds. 4. The Yelp data is JSON. Somewhere near the top of the page (perhaps directly under "Querying Data with Drill”), we should say: The Yelp data is in JSON format. Where the “JSON format” would be link to the JSON docs: https://drill.apache.org/docs/json-data-model/ This is handy later when we tell the user to set the all_text_mode: First, change Drill to work in all text mode (so we can take a look at all of the data). Where we should add: (See the JSON Data Model documentation for more information.) 5. This query: select attributes from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit 10; Appears all on one line and is truncated at the right of the page. Looks like we’ve broken our other long queries onto multiple lines. Perhaps this one needs the same treatment. 7. Here: "Top first categories in number of review counts" Perhaps copy the following text from the JSON format page to add explanation: “Query Complex Data” show how to use composite types to access nested arrays. 8. Another nit. Consider "Top businesses with cool rated reviews”. This (and similar items) are headers, but appear as regular text. The items have the HTML h4 tag, but have no special formatting. Can we make them bold or some such? 9. The following example SQL has two problems: 0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as Select b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, r.`date` from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` b, dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r where r.business_id=b.business_id First, the third line scrolls off the page on my (moderate sized) page. Perhaps split it after “b, “. Second, the statement must end with a semi-colon: “b.business_id;”. 10. Another nit. This paragraph: "The goal of Apache Drill is to provide the freedom and flexibility in exploring data in ways we have never seen before with SQL technologies. The community is working on more exciting features around nested data and supporting data with changing schemas in upcoming releases." Would seem to be a better fit at the top of the page rather than toward the end. 11. Another nit. This paragraph: "In addition to these queries, you can get many deep insights using Drill’s SQL functionality. If you are not comfortable with writing queries manually, you can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC drivers." Seems like a fine summary. However, it is currently awkwardly placed between two examples. I’ll guess that the FLATTEN example was added later. Perhaps move this paragraph to the end before “Stay tuned…” so that it is returned to its role as a summary. 12. Then, reverse the order of the following two items: "The FLATTEN function can be used to dynamically rationalize semi-structured data so you can apply even deeper SQL functionality. Here is a sample query:" "Get a flattened list of categories for each business" Once the “Get a flattened…” text looks like a header, the “The FLATTEN function…” text is a nice introductory paragraph. 13. At the end of the page we have a list of links. The list could use some work. The first item (download) shows the link separate from the label, the others have the link tied to the label. Can we make this consistent? Also, we should link to the topics covered on this page such as JSON format (see above for link), File system plugin (http://drill.apache.org/docs/file-system-storage-plugin/), and so on. was: Consider the topic paragraph for the Yelp sample data page: http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/ It could use a bit of TLC. For example: "Apache Drill is one of the fastest growing open source projects, with the community making rapid progress with monthly releases The key difference is Drill’s agility and flexibility." This is a non-sequiter. The speed and agility of the software does not drive the monthly releases. Can we reword it to say that Drill’s speed and agility makes it a popular project? And that many people work hard to make it better with monthly releases? Something like that... (Although, at present, releases have dropped to bi-monthly or quarterly...) And: "Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve low latency performance at scale, …" Seems two problems. 1. What does it mean “meeting the table stakes”? Very unclear. 2. This is a run-on sentence that tries to say multiple thoughts in a single sentence and should be rewritten. Then, there is redundancy: "...Drill allows users to analyze the data without any ETL or up-front schema definitions. … Drill, has a “no schema” approach…" I’m sure this paragraph was written quickly early on, but it could certainly be improved a bit… > Improvements to "Analyzing the Yelp Academic Dataset" > ----------------------------------------------------- > > Key: DRILL-4740 > URL: https://issues.apache.org/jira/browse/DRILL-4740 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation > Affects Versions: 1.6.0 > Reporter: Paul Rogers > Priority: Minor > > Consider the topic paragraph for the Yelp sample data page: > http://drill.apache.org/docs/analyzing-the-yelp-academic-dataset/ > It could use a bit of TLC. For example: > "Apache Drill is one of the fastest growing open source projects, with the > community making rapid progress with monthly releases The key difference is > Drill’s agility and flexibility." > This is a non-sequiter. The speed and agility of the software does not drive > the monthly releases. Can we reword it to say that Drill’s speed and agility > makes it a popular project? And that many people work hard to make it better > with monthly releases? Something like that... > (Although, at present, releases have dropped to bi-monthly or quarterly...) > And: > "Along with meeting the table stakes for SQL-on-Hadoop, which is to achieve > low latency performance at scale, …" > Seems two problems. > 1. What does it mean “meeting the table stakes”? Very unclear. > 2. This is a run-on sentence that tries to say multiple thoughts in a single > sentence and should be rewritten. > Then, there is redundancy: > "...Drill allows users to analyze the data without any ETL or up-front schema > definitions. … Drill, has a “no schema” approach…" > I’m sure this paragraph was written quickly early on, but it could certainly > be improved a bit… > More comments: > 1. Minor nit: "This document aligns Drill output for example purposes. Drill > output is not aligned in this case." > I think that what this is saying is, “Drill output in this document is > aligned for clarity. The actual Drill output you see may not be aligned.” > It would be better to explain why it is not aligned here, since data is > aligned in the earlier examples… > 2. Somewhat off: "You can directly query self-describing files such as JSON, > Parquet, and text. There is no need to create metadata definitions in the > Hive metastore." > I think what this is saying is that Drill infers schema information from > self-describing files such as JSON, Parquet and CSV/TSV (with a header row). > Contrast this with other systems, such as Hive, that require that you first > define the schema in a data dictionary. > Note that text is NOT a self-describing file format in the general case! > 3. Yelp seems to be creating new revisions of their data set. I downloaded > Round 7. The results differ from those in the Drill page text. Perhaps insert > a statement that the examples used Round (whatever round) and that the > reader’s results may differ when using later rounds. > 4. The Yelp data is JSON. Somewhere near the top of the page (perhaps > directly under "Querying Data with Drill”), we should say: > The Yelp data is in JSON format. > Where the “JSON format” would be link to the JSON docs: > https://drill.apache.org/docs/json-data-model/ > This is handy later when we tell the user to set the all_text_mode: > First, change Drill to work in all text mode (so we can take a look at all of > the data). > Where we should add: (See the JSON Data Model documentation for more > information.) > 5. This query: > select attributes from > dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` limit > 10; > Appears all on one line and is truncated at the right of the page. Looks like > we’ve broken our other long queries onto multiple lines. Perhaps this one > needs the same treatment. > 7. Here: "Top first categories in number of review counts" > Perhaps copy the following text from the JSON format page to add explanation: > “Query Complex Data” show how to use composite types to access nested arrays. > 8. Another nit. Consider "Top businesses with cool rated reviews”. This (and > similar items) are headers, but appear as regular text. The items have the > HTML h4 tag, but have no special formatting. Can we make them bold or some > such? > 9. The following example SQL has two problems: > 0: jdbc:drill:zk=local> create or replace view dfs.tmp.businessreviews as > Select > b.name,b.stars,b.state,b.city,r.votes.funny,r.votes.useful,r.votes.cool, > r.`date` > from dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_business.json` > b, dfs.`/<path-to-yelp-dataset>/yelp/yelp_academic_dataset_review.json` r > where r.business_id=b.business_id > First, the third line scrolls off the page on my (moderate sized) page. > Perhaps split it after “b, “. > Second, the statement must end with a semi-colon: “b.business_id;”. > 10. Another nit. This paragraph: > "The goal of Apache Drill is to provide the freedom and flexibility in > exploring data in ways we have never seen before with SQL technologies. The > community is working on more exciting features around nested data and > supporting data with changing schemas in upcoming releases." > Would seem to be a better fit at the top of the page rather than toward the > end. > 11. Another nit. This paragraph: > "In addition to these queries, you can get many deep insights using Drill’s > SQL functionality. If you are not comfortable with writing queries manually, > you can use a BI/Analytics tools such as Tableau/MicroStrategy to query raw > files/Hive/HBase data or Drill-created views directly using Drill ODBC/JDBC > drivers." > Seems like a fine summary. However, it is currently awkwardly placed between > two examples. I’ll guess that the FLATTEN example was added later. Perhaps > move this paragraph to the end before “Stay tuned…” so that it is returned to > its role as a summary. > 12. Then, reverse the order of the following two items: "The FLATTEN function > can be used to dynamically rationalize semi-structured data so you can apply > even deeper SQL functionality. Here is a sample query:" > "Get a flattened list of categories for each business" > Once the “Get a flattened…” text looks like a header, the “The FLATTEN > function…” text is a nice introductory paragraph. > 13. At the end of the page we have a list of links. The list could use some > work. The first item (download) shows the link separate from the label, the > others have the link tied to the label. Can we make this consistent? > Also, we should link to the topics covered on this page such as JSON format > (see above for link), File system plugin > (http://drill.apache.org/docs/file-system-storage-plugin/), and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)