Re: [PR] docs(blog): add Hudi PuppyGraph blog [hudi]

via GitHub Thu, 02 Oct 2025 13:00:57 -0700


xushiyan commented on code in PR #14040:
URL: https://github.com/apache/hudi/pull/14040#discussion_r2399910822



##########
website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx:
##########
@@ -0,0 +1,241 @@
+---
+title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph"
+excerpt: ""
+author: Jaz Samantha Ku, in collaboration with Shiyan Xu
+category: blog
+image: 
/assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png
+tags:
+- Apache Hudi
+- PuppyGraph
+- security
+---
+
+[CrowdStrike’s 2025 Global Threat 
Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average 
eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means 
that by the time security teams are even alerted about the potential breach, 
attackers have already long infiltrated the system. And that’s assuming they 
even get alerted. Cloud environments generate massive amounts of access logs, 
configuration changes, alerts, and telemetry. Reviewing these events in 
isolation rarely surfaces patterns like lateral movement or privilege 
escalation.
+
+Security tools such as SIEM, CSPM, and cloud workload protection need 
relationship-based analysis. It is not only a login attempt or a policy change, 
but also who acted, which systems were touched, what privileges were active, 
and what happened next. Event-centric methods struggle to answer those 
questions at scale. Graph analysis fits better because it captures paths and 
context across entities.
+
+To keep up, the data pipeline must support:
+
+* Continuous upserts with low lag so detections run on the latest state  
+* Incremental consumption so analytics read only “what changed since T”  
+* A rewindable timeline so responders can review state during investigations
+
+With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables 
support fast upserts and incremental processing. PuppyGraph queries 
relationships in place using openCypher or Gremlin. In this blog, we explore 
how to get started with real-time security graph analytics at scale using the 
data already stored in your Hudi lakehouse tables.
+
+## Why Apache Hudi for Cybersecurity Data?
+
+[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform 
that brings ACID transaction guarantees to data lakes. It enables efficient, 
record-level updates and deletes on massive datasets, which makes it a strong 
foundation for storing and analyzing cybersecurity data such as logs, 
telemetry, and threat intelligence. Its combination of performance, 
flexibility, and broad ecosystem integration is well-suited for threat 
detection, forensic investigation, and compliance work.
+
+Hudi speeds up large-scale security analytics through features that keep 
tables both current and query-efficient. Hudi writers excel at handling 
continuous, mutable workloads without requiring costly full rewrites. Hudi’s 
[multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed 
by its internal metadata table, offers efficient lookups and data skipping, 
dramatically accelerating queries that scan massive log sets to isolate 
suspicious activity. Hudi keeps tables updatable and queryable as they change, 
with time-travel and incremental reads for point-in-time forensic analysis.
+
+Even as data volumes grow, operations remain manageable. Hudi tracks every 
commit on a timeline, enabling powerful time-travel queries for historical 
investigations. Asynchronous table services like 
[compaction](https://hudi.apache.org/docs/compaction), 
[clustering](https://hudi.apache.org/docs/clustering), 
[cleaning](https://hudi.apache.org/docs/cleaning), and 
[indexing](https://hudi.apache.org/docs/metadata_indexing) run in the 
background to maintain peak performance and storage health while minimizing 
disruption to ingestion pipelines. Furthermore, its consistent commit and 
delete semantics support the creation of reliable audit trails, simplify data 
retention policies, and help meet privacy requirements.
+
+<figure>
+![](/assets/images/hudi-stack-1-x.png)
+<figcaption>The [Apache Hudi 
Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption>
+</figure>
+
+
+Hudi also integrates seamlessly with the tools security teams already use. You 
can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in 
Hive Metastore or AWS Glue Catalog, and query them from popular query engines 
like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph 
connects to the same Hudi tables and runs openCypher or Gremlin queries 
directly on them via the user access layer, so you get real-time graph 
analytics on the lake with no ETL and no data duplication.
+
+## Why PuppyGraph for Cybersecurity Data?
+
+[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph 
query engine. It lets data teams query existing relational stores as a single 
graph and get up and running in under 10 minutes, avoiding the cost, latency, 
and maintenance of a separate graph database. To understand why this is so 
important, let’s take a look at the status quo.
+
+### Traditional Analytics on the Lake
+
+Security teams already store logs, configs, and alerts in a lakehouse. SQL 
engines are great for counts, filters, rollups, and point lookups. They 
struggle when questions depend on relationships. Lateral movement, privilege 
escalation, and blast radius span many tables and time windows. Each new join 
adds complexity, pushes latency up, and breaks easily when schemas evolve or 
events arrive late. You can stitch context with views and pipelines, but it is 
fragile and slow to adapt.
+
+### Dedicated Graph Databases
+
+Graphs make paths and neighborhoods first class. Graph queries let you answer 
“what connects to what” in a way that makes sense, without the need for 
confusing data joins. The tradeoff is operations and freshness. Most graph 
databases want their own storage. That means ETL, a second copy, and lag 
between source and graph. Continuous upserts are heavy because every change can 
touch nodes, edges, and multiple indexes. Running a separate cluster adds 
backups, upgrades, sizing, and vendor-specific tuning. During an incident, that 
overhead shows up as stale data and slower investigations.
+
+### How PuppyGraph Helps
+
+PuppyGraph is not a traditional graph database but a graph query engine 
designed to run directly on top of your existing data infrastructure without 
costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" 
approach is its core differentiator, allowing you to query relational data in 
data warehouses, data lakes, and databases as a unified graph model in minutes.
+
+Instead of migrating data into a specialized store, PuppyGraph connects to 
sources including 
[PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), 
[Apache 
Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), 
[Apache 
Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), 
[BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq),
 and others, then builds a virtual graph layer over them. Graph models are 
defined through simple JSON schema files, making it easy to update, version, or 
switch graph views without touching the underlying data. From there, you can 
quickly begin exploring your data with graph queries written in Gremlin or 
openCypher.
+
+<figure>
+![][image2]

Review Comment:
   TODO



##########
website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx:
##########
@@ -0,0 +1,241 @@
+---
+title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph"
+excerpt: ""
+author: Jaz Samantha Ku, in collaboration with Shiyan Xu
+category: blog
+image: 
/assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png
+tags:
+- Apache Hudi
+- PuppyGraph
+- security
+---
+
+[CrowdStrike’s 2025 Global Threat 
Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average 
eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means 
that by the time security teams are even alerted about the potential breach, 
attackers have already long infiltrated the system. And that’s assuming they 
even get alerted. Cloud environments generate massive amounts of access logs, 
configuration changes, alerts, and telemetry. Reviewing these events in 
isolation rarely surfaces patterns like lateral movement or privilege 
escalation.
+
+Security tools such as SIEM, CSPM, and cloud workload protection need 
relationship-based analysis. It is not only a login attempt or a policy change, 
but also who acted, which systems were touched, what privileges were active, 
and what happened next. Event-centric methods struggle to answer those 
questions at scale. Graph analysis fits better because it captures paths and 
context across entities.
+
+To keep up, the data pipeline must support:
+
+* Continuous upserts with low lag so detections run on the latest state  
+* Incremental consumption so analytics read only “what changed since T”  
+* A rewindable timeline so responders can review state during investigations
+
+With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables 
support fast upserts and incremental processing. PuppyGraph queries 
relationships in place using openCypher or Gremlin. In this blog, we explore 
how to get started with real-time security graph analytics at scale using the 
data already stored in your Hudi lakehouse tables.
+
+## Why Apache Hudi for Cybersecurity Data?
+
+[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform 
that brings ACID transaction guarantees to data lakes. It enables efficient, 
record-level updates and deletes on massive datasets, which makes it a strong 
foundation for storing and analyzing cybersecurity data such as logs, 
telemetry, and threat intelligence. Its combination of performance, 
flexibility, and broad ecosystem integration is well-suited for threat 
detection, forensic investigation, and compliance work.
+
+Hudi speeds up large-scale security analytics through features that keep 
tables both current and query-efficient. Hudi writers excel at handling 
continuous, mutable workloads without requiring costly full rewrites. Hudi’s 
[multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed 
by its internal metadata table, offers efficient lookups and data skipping, 
dramatically accelerating queries that scan massive log sets to isolate 
suspicious activity. Hudi keeps tables updatable and queryable as they change, 
with time-travel and incremental reads for point-in-time forensic analysis.
+
+Even as data volumes grow, operations remain manageable. Hudi tracks every 
commit on a timeline, enabling powerful time-travel queries for historical 
investigations. Asynchronous table services like 
[compaction](https://hudi.apache.org/docs/compaction), 
[clustering](https://hudi.apache.org/docs/clustering), 
[cleaning](https://hudi.apache.org/docs/cleaning), and 
[indexing](https://hudi.apache.org/docs/metadata_indexing) run in the 
background to maintain peak performance and storage health while minimizing 
disruption to ingestion pipelines. Furthermore, its consistent commit and 
delete semantics support the creation of reliable audit trails, simplify data 
retention policies, and help meet privacy requirements.
+
+<figure>
+![](/assets/images/hudi-stack-1-x.png)
+<figcaption>The [Apache Hudi 
Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption>
+</figure>
+
+
+Hudi also integrates seamlessly with the tools security teams already use. You 
can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in 
Hive Metastore or AWS Glue Catalog, and query them from popular query engines 
like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph 
connects to the same Hudi tables and runs openCypher or Gremlin queries 
directly on them via the user access layer, so you get real-time graph 
analytics on the lake with no ETL and no data duplication.
+
+## Why PuppyGraph for Cybersecurity Data?
+
+[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph 
query engine. It lets data teams query existing relational stores as a single 
graph and get up and running in under 10 minutes, avoiding the cost, latency, 
and maintenance of a separate graph database. To understand why this is so 
important, let’s take a look at the status quo.
+
+### Traditional Analytics on the Lake
+
+Security teams already store logs, configs, and alerts in a lakehouse. SQL 
engines are great for counts, filters, rollups, and point lookups. They 
struggle when questions depend on relationships. Lateral movement, privilege 
escalation, and blast radius span many tables and time windows. Each new join 
adds complexity, pushes latency up, and breaks easily when schemas evolve or 
events arrive late. You can stitch context with views and pipelines, but it is 
fragile and slow to adapt.
+
+### Dedicated Graph Databases
+
+Graphs make paths and neighborhoods first class. Graph queries let you answer 
“what connects to what” in a way that makes sense, without the need for 
confusing data joins. The tradeoff is operations and freshness. Most graph 
databases want their own storage. That means ETL, a second copy, and lag 
between source and graph. Continuous upserts are heavy because every change can 
touch nodes, edges, and multiple indexes. Running a separate cluster adds 
backups, upgrades, sizing, and vendor-specific tuning. During an incident, that 
overhead shows up as stale data and slower investigations.
+
+### How PuppyGraph Helps
+
+PuppyGraph is not a traditional graph database but a graph query engine 
designed to run directly on top of your existing data infrastructure without 
costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" 
approach is its core differentiator, allowing you to query relational data in 
data warehouses, data lakes, and databases as a unified graph model in minutes.
+
+Instead of migrating data into a specialized store, PuppyGraph connects to 
sources including 
[PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), 
[Apache 
Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), 
[Apache 
Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), 
[BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq),
 and others, then builds a virtual graph layer over them. Graph models are 
defined through simple JSON schema files, making it easy to update, version, or 
switch graph views without touching the underlying data. From there, you can 
quickly begin exploring your data with graph queries written in Gremlin or 
openCypher.
+
+<figure>
+![][image2]
+<figcaption>PuppyGraph Supported Data Sources</figcaption>
+</figure>
+
+<figure>
+![][image3]
+<figcaption>Architecture with Graph Database vs. with PuppyGraph</figcaption>
+</figure>
+
+This approach aligns with the broader shift in modern data stacks to separate 
compute from storage. You keep data where it belongs and scale query power 
independently, which supports petabyte-level workloads without duplicating data 
or managing fragile pipelines.
+
+## Real-World Use Case
+
+We have shown why cloud security benefits from a relationship-first view of 
identities, resources, and events. In this demo, we’ll show how easy it is to 
begin querying your cloud security data as a graph. Apache Hudi keeps those 
tables current with streaming upserts and an investigation-friendly timeline. 
PuppyGraph lets you query your existing lake tables as a graph. Together they 
give you real-time security graph analytics on the data you already store.
+
+Getting started is straightforward. You will deploy the stack, load security 
data into Hudi, connect PuppyGraph to your catalog, define a graph view, and 
run a few queries. All in a matter of minutes.
+
+<figure>
+![][image4]

Review Comment:
   same for below



##########
website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx:
##########
@@ -0,0 +1,241 @@
+---
+title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph"
+excerpt: ""
+author: Jaz Samantha Ku, in collaboration with Shiyan Xu
+category: blog
+image: 
/assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png
+tags:
+- Apache Hudi
+- PuppyGraph
+- security
+---
+
+[CrowdStrike’s 2025 Global Threat 
Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average 
eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means 
that by the time security teams are even alerted about the potential breach, 
attackers have already long infiltrated the system. And that’s assuming they 
even get alerted. Cloud environments generate massive amounts of access logs, 
configuration changes, alerts, and telemetry. Reviewing these events in 
isolation rarely surfaces patterns like lateral movement or privilege 
escalation.
+
+Security tools such as SIEM, CSPM, and cloud workload protection need 
relationship-based analysis. It is not only a login attempt or a policy change, 
but also who acted, which systems were touched, what privileges were active, 
and what happened next. Event-centric methods struggle to answer those 
questions at scale. Graph analysis fits better because it captures paths and 
context across entities.
+
+To keep up, the data pipeline must support:
+
+* Continuous upserts with low lag so detections run on the latest state  
+* Incremental consumption so analytics read only “what changed since T”  
+* A rewindable timeline so responders can review state during investigations
+
+With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables 
support fast upserts and incremental processing. PuppyGraph queries 
relationships in place using openCypher or Gremlin. In this blog, we explore 
how to get started with real-time security graph analytics at scale using the 
data already stored in your Hudi lakehouse tables.
+
+## Why Apache Hudi for Cybersecurity Data?
+
+[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform 
that brings ACID transaction guarantees to data lakes. It enables efficient, 
record-level updates and deletes on massive datasets, which makes it a strong 
foundation for storing and analyzing cybersecurity data such as logs, 
telemetry, and threat intelligence. Its combination of performance, 
flexibility, and broad ecosystem integration is well-suited for threat 
detection, forensic investigation, and compliance work.
+
+Hudi speeds up large-scale security analytics through features that keep 
tables both current and query-efficient. Hudi writers excel at handling 
continuous, mutable workloads without requiring costly full rewrites. Hudi’s 
[multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed 
by its internal metadata table, offers efficient lookups and data skipping, 
dramatically accelerating queries that scan massive log sets to isolate 
suspicious activity. Hudi keeps tables updatable and queryable as they change, 
with time-travel and incremental reads for point-in-time forensic analysis.
+
+Even as data volumes grow, operations remain manageable. Hudi tracks every 
commit on a timeline, enabling powerful time-travel queries for historical 
investigations. Asynchronous table services like 
[compaction](https://hudi.apache.org/docs/compaction), 
[clustering](https://hudi.apache.org/docs/clustering), 
[cleaning](https://hudi.apache.org/docs/cleaning), and 
[indexing](https://hudi.apache.org/docs/metadata_indexing) run in the 
background to maintain peak performance and storage health while minimizing 
disruption to ingestion pipelines. Furthermore, its consistent commit and 
delete semantics support the creation of reliable audit trails, simplify data 
retention policies, and help meet privacy requirements.
+
+<figure>
+![](/assets/images/hudi-stack-1-x.png)
+<figcaption>The [Apache Hudi 
Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption>
+</figure>
+
+
+Hudi also integrates seamlessly with the tools security teams already use. You 
can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in 
Hive Metastore or AWS Glue Catalog, and query them from popular query engines 
like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph 
connects to the same Hudi tables and runs openCypher or Gremlin queries 
directly on them via the user access layer, so you get real-time graph 
analytics on the lake with no ETL and no data duplication.
+
+## Why PuppyGraph for Cybersecurity Data?
+
+[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph 
query engine. It lets data teams query existing relational stores as a single 
graph and get up and running in under 10 minutes, avoiding the cost, latency, 
and maintenance of a separate graph database. To understand why this is so 
important, let’s take a look at the status quo.
+
+### Traditional Analytics on the Lake
+
+Security teams already store logs, configs, and alerts in a lakehouse. SQL 
engines are great for counts, filters, rollups, and point lookups. They 
struggle when questions depend on relationships. Lateral movement, privilege 
escalation, and blast radius span many tables and time windows. Each new join 
adds complexity, pushes latency up, and breaks easily when schemas evolve or 
events arrive late. You can stitch context with views and pipelines, but it is 
fragile and slow to adapt.
+
+### Dedicated Graph Databases
+
+Graphs make paths and neighborhoods first class. Graph queries let you answer 
“what connects to what” in a way that makes sense, without the need for 
confusing data joins. The tradeoff is operations and freshness. Most graph 
databases want their own storage. That means ETL, a second copy, and lag 
between source and graph. Continuous upserts are heavy because every change can 
touch nodes, edges, and multiple indexes. Running a separate cluster adds 
backups, upgrades, sizing, and vendor-specific tuning. During an incident, that 
overhead shows up as stale data and slower investigations.
+
+### How PuppyGraph Helps
+
+PuppyGraph is not a traditional graph database but a graph query engine 
designed to run directly on top of your existing data infrastructure without 
costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" 
approach is its core differentiator, allowing you to query relational data in 
data warehouses, data lakes, and databases as a unified graph model in minutes.
+
+Instead of migrating data into a specialized store, PuppyGraph connects to 
sources including 
[PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), 
[Apache 
Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), 
[Apache 
Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), 
[BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq),
 and others, then builds a virtual graph layer over them. Graph models are 
defined through simple JSON schema files, making it easy to update, version, or 
switch graph views without touching the underlying data. From there, you can 
quickly begin exploring your data with graph queries written in Gremlin or 
openCypher.
+
+<figure>
+![][image2]
+<figcaption>PuppyGraph Supported Data Sources</figcaption>
+</figure>
+
+<figure>
+![][image3]
+<figcaption>Architecture with Graph Database vs. with PuppyGraph</figcaption>
+</figure>
+
+This approach aligns with the broader shift in modern data stacks to separate 
compute from storage. You keep data where it belongs and scale query power 
independently, which supports petabyte-level workloads without duplicating data 
or managing fragile pipelines.
+
+## Real-World Use Case
+
+We have shown why cloud security benefits from a relationship-first view of 
identities, resources, and events. In this demo, we’ll show how easy it is to 
begin querying your cloud security data as a graph. Apache Hudi keeps those 
tables current with streaming upserts and an investigation-friendly timeline. 
PuppyGraph lets you query your existing lake tables as a graph. Together they 
give you real-time security graph analytics on the data you already store.
+
+Getting started is straightforward. You will deploy the stack, load security 
data into Hudi, connect PuppyGraph to your catalog, define a graph view, and 
run a few queries. All in a matter of minutes.
+
+<figure>
+![][image4]

Review Comment:
   TODO



##########
website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx:
##########
@@ -0,0 +1,241 @@
+---
+title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph"
+excerpt: ""
+author: Jaz Samantha Ku, in collaboration with Shiyan Xu
+category: blog
+image: 
/assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png
+tags:
+- Apache Hudi
+- PuppyGraph
+- security
+---
+
+[CrowdStrike’s 2025 Global Threat 
Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average 
eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means 
that by the time security teams are even alerted about the potential breach, 
attackers have already long infiltrated the system. And that’s assuming they 
even get alerted. Cloud environments generate massive amounts of access logs, 
configuration changes, alerts, and telemetry. Reviewing these events in 
isolation rarely surfaces patterns like lateral movement or privilege 
escalation.
+
+Security tools such as SIEM, CSPM, and cloud workload protection need 
relationship-based analysis. It is not only a login attempt or a policy change, 
but also who acted, which systems were touched, what privileges were active, 
and what happened next. Event-centric methods struggle to answer those 
questions at scale. Graph analysis fits better because it captures paths and 
context across entities.
+
+To keep up, the data pipeline must support:
+
+* Continuous upserts with low lag so detections run on the latest state  
+* Incremental consumption so analytics read only “what changed since T”  
+* A rewindable timeline so responders can review state during investigations
+
+With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables 
support fast upserts and incremental processing. PuppyGraph queries 
relationships in place using openCypher or Gremlin. In this blog, we explore 
how to get started with real-time security graph analytics at scale using the 
data already stored in your Hudi lakehouse tables.
+
+## Why Apache Hudi for Cybersecurity Data?
+
+[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform 
that brings ACID transaction guarantees to data lakes. It enables efficient, 
record-level updates and deletes on massive datasets, which makes it a strong 
foundation for storing and analyzing cybersecurity data such as logs, 
telemetry, and threat intelligence. Its combination of performance, 
flexibility, and broad ecosystem integration is well-suited for threat 
detection, forensic investigation, and compliance work.
+
+Hudi speeds up large-scale security analytics through features that keep 
tables both current and query-efficient. Hudi writers excel at handling 
continuous, mutable workloads without requiring costly full rewrites. Hudi’s 
[multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed 
by its internal metadata table, offers efficient lookups and data skipping, 
dramatically accelerating queries that scan massive log sets to isolate 
suspicious activity. Hudi keeps tables updatable and queryable as they change, 
with time-travel and incremental reads for point-in-time forensic analysis.
+
+Even as data volumes grow, operations remain manageable. Hudi tracks every 
commit on a timeline, enabling powerful time-travel queries for historical 
investigations. Asynchronous table services like 
[compaction](https://hudi.apache.org/docs/compaction), 
[clustering](https://hudi.apache.org/docs/clustering), 
[cleaning](https://hudi.apache.org/docs/cleaning), and 
[indexing](https://hudi.apache.org/docs/metadata_indexing) run in the 
background to maintain peak performance and storage health while minimizing 
disruption to ingestion pipelines. Furthermore, its consistent commit and 
delete semantics support the creation of reliable audit trails, simplify data 
retention policies, and help meet privacy requirements.
+
+<figure>
+![](/assets/images/hudi-stack-1-x.png)
+<figcaption>The [Apache Hudi 
Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption>
+</figure>
+
+
+Hudi also integrates seamlessly with the tools security teams already use. You 
can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in 
Hive Metastore or AWS Glue Catalog, and query them from popular query engines 
like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph 
connects to the same Hudi tables and runs openCypher or Gremlin queries 
directly on them via the user access layer, so you get real-time graph 
analytics on the lake with no ETL and no data duplication.
+
+## Why PuppyGraph for Cybersecurity Data?
+
+[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph 
query engine. It lets data teams query existing relational stores as a single 
graph and get up and running in under 10 minutes, avoiding the cost, latency, 
and maintenance of a separate graph database. To understand why this is so 
important, let’s take a look at the status quo.
+
+### Traditional Analytics on the Lake
+
+Security teams already store logs, configs, and alerts in a lakehouse. SQL 
engines are great for counts, filters, rollups, and point lookups. They 
struggle when questions depend on relationships. Lateral movement, privilege 
escalation, and blast radius span many tables and time windows. Each new join 
adds complexity, pushes latency up, and breaks easily when schemas evolve or 
events arrive late. You can stitch context with views and pipelines, but it is 
fragile and slow to adapt.
+
+### Dedicated Graph Databases
+
+Graphs make paths and neighborhoods first class. Graph queries let you answer 
“what connects to what” in a way that makes sense, without the need for 
confusing data joins. The tradeoff is operations and freshness. Most graph 
databases want their own storage. That means ETL, a second copy, and lag 
between source and graph. Continuous upserts are heavy because every change can 
touch nodes, edges, and multiple indexes. Running a separate cluster adds 
backups, upgrades, sizing, and vendor-specific tuning. During an incident, that 
overhead shows up as stale data and slower investigations.
+
+### How PuppyGraph Helps
+
+PuppyGraph is not a traditional graph database but a graph query engine 
designed to run directly on top of your existing data infrastructure without 
costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" 
approach is its core differentiator, allowing you to query relational data in 
data warehouses, data lakes, and databases as a unified graph model in minutes.
+
+Instead of migrating data into a specialized store, PuppyGraph connects to 
sources including 
[PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), 
[Apache 
Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), 
[Apache 
Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), 
[BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq),
 and others, then builds a virtual graph layer over them. Graph models are 
defined through simple JSON schema files, making it easy to update, version, or 
switch graph views without touching the underlying data. From there, you can 
quickly begin exploring your data with graph queries written in Gremlin or 
openCypher.
+
+<figure>
+![][image2]
+<figcaption>PuppyGraph Supported Data Sources</figcaption>
+</figure>
+
+<figure>
+![][image3]

Review Comment:
   TODO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs(blog): add Hudi PuppyGraph blog [hudi]

Reply via email to