[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

GitBox Wed, 25 Sep 2019 15:28:33 -0700

yihua commented on a change in pull request #925: [HUDI-256] Translate 
Comparison page
URL: https://github.com/apache/incubator-hudi/pull/925#discussion_r328363112


 ##########
 File path: docs/comparison.cn.md
 ##########
 @@ -6,53 +6,45 @@ permalink: comparison.html
 toc: false
 ---
 
-Apache Hudi fills a big void for processing data on top of DFS, and thus 
mostly co-exists nicely with these technologies. However,
-it would be useful to understand how Hudi fits into the current big data 
ecosystem, contrasting it with a few related systems
-and bring out the different tradeoffs these systems have accepted in their 
design.
+Apache Hudi填补了在DFS上处理数据的巨大空白，并可以和这些技术很好地共存。然而，
+了解Hudi如何适应当前的大数据生态系统，并将其与一些相关系统进行对比，了解这些系统在设计中做的不同权衡将非常有用。
 
 ## Kudu
 
-[Apache Kudu](https://kudu.apache.org) is a storage system that has similar 
goals as Hudi, which is to bring real-time analytics on petabytes of data via 
first
-class support for `upserts`. A key differentiator is that Kudu also attempts 
to serve as a datastore for OLTP workloads, something that Hudi does not aspire 
to be.
-Consequently, Kudu does not support incremental pulling (as of early 2017), 
something Hudi does to enable incremental processing use cases.
+[Apache 
Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统，该系统通过对`upserts`支持来对PB级数据进行实时分析。
+一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储，而Hudi并不希望这样做。
+因此，Kudu不支持增量拉取(截至2017年初)，而Hudi支持以便进行增量处理。
 
+Kudu与分布式文件系统抽象和HDFS完全不同，它自己的一组存储服务器通过RAFT相互通信。
+另一方面，Hudi旨在与底层Hadoop兼容文件系统(HDFS，S3或Ceph)一起使用，并且没有自己的存储服务器群，而是依靠Apache 
Spark来完成繁重的工作。
+因此，Hudi可以像其他Spark作业一样轻松扩展，而Kudu则需要硬件和运营支持，特别是HBase或Vertica等数据存储系统。
+到目前为止，我们还没有针对Kudu做任何正面的基准测试(鉴于RTTable正在进行中)。
+但是，如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and
 -存储引擎)，
+我们希望Hudi定位于能吸纳parquet的卓越性能。
 
-Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
-Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
-& operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
-But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
-we expect Hudi to positioned at something that ingests parquet with superior 
performance.
+## Hive事务
 
+[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作，它试图实现像在ORC文件格式之上的`读取时合并`。
+可以理解，此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
+Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
+在实现选择方面，Hudi充分利用了类似Spark的处理框架的功能，而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
+根据我们的生产经验，与其他方法相比，将Hudi作为库嵌入到现有的Spark管道中要容易得多，并且操作不会太繁琐。
+Hudi还设计用于与Presto/Spark等非Hive引擎合作，并将随着时间的推移合并除parquet以外的文件格式。
 
-## Hive Transactions
+## HBase
 
-[Hive 
Transactions/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)
 is another similar effort, which tries to implement storage like
-`merge-on-read`, on top of ORC file format. Understandably, this feature is 
heavily tied to Hive and other efforts like 
[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP).
-Hive transactions does not offer the read-optimized storage option or the 
incremental pulling, that Hudi does. In terms of implementation choices, Hudi 
leverages
-the full power of a processing framework like Spark, while Hive transactions 
feature is implemented underneath by Hive tasks/queries kicked off by user or 
the Hive metastore.
-Based on our production experience, embedding Hudi as a library into existing 
Spark pipelines was much easier and less operationally heavy, compared with the 
other approach.
-Hudi is also designed to work with non-hive enginers like Presto/Spark and 
will incorporate file formats other than parquet over time.
+尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层，但由于与Hadoop的相似性，用户通常倾向于将HBase与分析相关联。
+鉴于HBase经过严格的写优化，它支持开箱即用的亚秒级更新，Hive-on-HBase允许用户查询该数据。 
但是，就分析工作负载的实际性能而言，Parquet/ORC之类的混合列式存储格式可以轻松击败HBase，因为这些工作负载主要是读取繁重的工作。
+Hudi弥补了更快的数据与分析存储格式之间的差距。从操作的角度来看，与管理分析使用的HBase 
region服务器集群相比，为用户提供可提供更快数据的库更具可扩展性。
 
 Review comment:
   ”操作的角度“ => “运营的角度”
   “为用户提供可提供更快数据的库” => “为用户提供可更快给出数据的库”

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on a change in pull request #925: [HUDI-256] Translate Comparison page

Reply via email to