[ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257719#comment-16257719 ]
Ramesh Mani commented on RANGER-1837: ------------------------------------- [~bosco] regarding the best practice I see this article https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html which has some details. Also pinging ORC team. [~owen.omalley] Pinging you for your input on the best practices which we need to adhere when creating ORC file. In this JIRA we are introducing an option to store Ranger Audits in HDFS as ORC file, but we need guidance on what would be the strategy we need to following, so that ORC file when queried through HIVE or any other mean would be optimal to produce result. What ORC batch configs we should expose in ranger to configure by which user can take advantage of. Could you please provide your thoughts? > Enhance Ranger Audit to HDFS to support ORC file format > ------------------------------------------------------- > > Key: RANGER-1837 > URL: https://issues.apache.org/jira/browse/RANGER-1837 > Project: Ranger > Issue Type: Improvement > Components: audit > Reporter: Kevin Risden > Assignee: Ramesh Mani > Attachments: > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch > > > My team has done some research and found that Ranger HDFS audits are: > * Stored as JSON objects (one per line) > * Not compressed > This is currently very verbose and would benefit from compression since this > data is not frequently accessed. > From Bosco on the mailing list: > {quote}You are right, currently one of the options is saving the audits in > HDFS itself as JSON files in one folder per day. I have loaded these JSON > files from the folder into Hive as compressed ORC format. The compressed > files in ORC were less than 10% of the original size. So, it was significant > decrease in size. Also, it is easier to run analytics on the Hive tables. > > So, there are couple of ways of doing it. > > Write an Oozie job which runs every night and loads the previous day worth > audit logs into ORC or other format > Write a AuditDestination which can write into the format you want to. > > Regardless which approach you take, this would be a good feature for > Ranger.{quote} > http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)