xushiyan commented on code in PR #14040: URL: https://github.com/apache/hudi/pull/14040#discussion_r2399910822
########## website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx: ########## @@ -0,0 +1,241 @@ +--- +title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph" +excerpt: "" +author: Jaz Samantha Ku, in collaboration with Shiyan Xu +category: blog +image: /assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png +tags: +- Apache Hudi +- PuppyGraph +- security +--- + +[CrowdStrike’s 2025 Global Threat Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation. + +Security tools such as SIEM, CSPM, and cloud workload protection need relationship-based analysis. It is not only a login attempt or a policy change, but also who acted, which systems were touched, what privileges were active, and what happened next. Event-centric methods struggle to answer those questions at scale. Graph analysis fits better because it captures paths and context across entities. + +To keep up, the data pipeline must support: + +* Continuous upserts with low lag so detections run on the latest state +* Incremental consumption so analytics read only “what changed since T” +* A rewindable timeline so responders can review state during investigations + +With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables. + +## Why Apache Hudi for Cybersecurity Data? + +[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform that brings ACID transaction guarantees to data lakes. It enables efficient, record-level updates and deletes on massive datasets, which makes it a strong foundation for storing and analyzing cybersecurity data such as logs, telemetry, and threat intelligence. Its combination of performance, flexibility, and broad ecosystem integration is well-suited for threat detection, forensic investigation, and compliance work. + +Hudi speeds up large-scale security analytics through features that keep tables both current and query-efficient. Hudi writers excel at handling continuous, mutable workloads without requiring costly full rewrites. Hudi’s [multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed by its internal metadata table, offers efficient lookups and data skipping, dramatically accelerating queries that scan massive log sets to isolate suspicious activity. Hudi keeps tables updatable and queryable as they change, with time-travel and incremental reads for point-in-time forensic analysis. + +Even as data volumes grow, operations remain manageable. Hudi tracks every commit on a timeline, enabling powerful time-travel queries for historical investigations. Asynchronous table services like [compaction](https://hudi.apache.org/docs/compaction), [clustering](https://hudi.apache.org/docs/clustering), [cleaning](https://hudi.apache.org/docs/cleaning), and [indexing](https://hudi.apache.org/docs/metadata_indexing) run in the background to maintain peak performance and storage health while minimizing disruption to ingestion pipelines. Furthermore, its consistent commit and delete semantics support the creation of reliable audit trails, simplify data retention policies, and help meet privacy requirements. + +<figure> + +<figcaption>The [Apache Hudi Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption> +</figure> + + +Hudi also integrates seamlessly with the tools security teams already use. You can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in Hive Metastore or AWS Glue Catalog, and query them from popular query engines like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph connects to the same Hudi tables and runs openCypher or Gremlin queries directly on them via the user access layer, so you get real-time graph analytics on the lake with no ETL and no data duplication. + +## Why PuppyGraph for Cybersecurity Data? + +[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. To understand why this is so important, let’s take a look at the status quo. + +### Traditional Analytics on the Lake + +Security teams already store logs, configs, and alerts in a lakehouse. SQL engines are great for counts, filters, rollups, and point lookups. They struggle when questions depend on relationships. Lateral movement, privilege escalation, and blast radius span many tables and time windows. Each new join adds complexity, pushes latency up, and breaks easily when schemas evolve or events arrive late. You can stitch context with views and pipelines, but it is fragile and slow to adapt. + +### Dedicated Graph Databases + +Graphs make paths and neighborhoods first class. Graph queries let you answer “what connects to what” in a way that makes sense, without the need for confusing data joins. The tradeoff is operations and freshness. Most graph databases want their own storage. That means ETL, a second copy, and lag between source and graph. Continuous upserts are heavy because every change can touch nodes, edges, and multiple indexes. Running a separate cluster adds backups, upgrades, sizing, and vendor-specific tuning. During an incident, that overhead shows up as stale data and slower investigations. + +### How PuppyGraph Helps + +PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes. + +Instead of migrating data into a specialized store, PuppyGraph connects to sources including [PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), [Apache Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), [Apache Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), [BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq), and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data. From there, you can quickly begin exploring your data with graph queries written in Gremlin or openCypher. + +<figure> +![][image2] Review Comment: TODO ########## website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx: ########## @@ -0,0 +1,241 @@ +--- +title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph" +excerpt: "" +author: Jaz Samantha Ku, in collaboration with Shiyan Xu +category: blog +image: /assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png +tags: +- Apache Hudi +- PuppyGraph +- security +--- + +[CrowdStrike’s 2025 Global Threat Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation. + +Security tools such as SIEM, CSPM, and cloud workload protection need relationship-based analysis. It is not only a login attempt or a policy change, but also who acted, which systems were touched, what privileges were active, and what happened next. Event-centric methods struggle to answer those questions at scale. Graph analysis fits better because it captures paths and context across entities. + +To keep up, the data pipeline must support: + +* Continuous upserts with low lag so detections run on the latest state +* Incremental consumption so analytics read only “what changed since T” +* A rewindable timeline so responders can review state during investigations + +With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables. + +## Why Apache Hudi for Cybersecurity Data? + +[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform that brings ACID transaction guarantees to data lakes. It enables efficient, record-level updates and deletes on massive datasets, which makes it a strong foundation for storing and analyzing cybersecurity data such as logs, telemetry, and threat intelligence. Its combination of performance, flexibility, and broad ecosystem integration is well-suited for threat detection, forensic investigation, and compliance work. + +Hudi speeds up large-scale security analytics through features that keep tables both current and query-efficient. Hudi writers excel at handling continuous, mutable workloads without requiring costly full rewrites. Hudi’s [multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed by its internal metadata table, offers efficient lookups and data skipping, dramatically accelerating queries that scan massive log sets to isolate suspicious activity. Hudi keeps tables updatable and queryable as they change, with time-travel and incremental reads for point-in-time forensic analysis. + +Even as data volumes grow, operations remain manageable. Hudi tracks every commit on a timeline, enabling powerful time-travel queries for historical investigations. Asynchronous table services like [compaction](https://hudi.apache.org/docs/compaction), [clustering](https://hudi.apache.org/docs/clustering), [cleaning](https://hudi.apache.org/docs/cleaning), and [indexing](https://hudi.apache.org/docs/metadata_indexing) run in the background to maintain peak performance and storage health while minimizing disruption to ingestion pipelines. Furthermore, its consistent commit and delete semantics support the creation of reliable audit trails, simplify data retention policies, and help meet privacy requirements. + +<figure> + +<figcaption>The [Apache Hudi Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption> +</figure> + + +Hudi also integrates seamlessly with the tools security teams already use. You can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in Hive Metastore or AWS Glue Catalog, and query them from popular query engines like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph connects to the same Hudi tables and runs openCypher or Gremlin queries directly on them via the user access layer, so you get real-time graph analytics on the lake with no ETL and no data duplication. + +## Why PuppyGraph for Cybersecurity Data? + +[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. To understand why this is so important, let’s take a look at the status quo. + +### Traditional Analytics on the Lake + +Security teams already store logs, configs, and alerts in a lakehouse. SQL engines are great for counts, filters, rollups, and point lookups. They struggle when questions depend on relationships. Lateral movement, privilege escalation, and blast radius span many tables and time windows. Each new join adds complexity, pushes latency up, and breaks easily when schemas evolve or events arrive late. You can stitch context with views and pipelines, but it is fragile and slow to adapt. + +### Dedicated Graph Databases + +Graphs make paths and neighborhoods first class. Graph queries let you answer “what connects to what” in a way that makes sense, without the need for confusing data joins. The tradeoff is operations and freshness. Most graph databases want their own storage. That means ETL, a second copy, and lag between source and graph. Continuous upserts are heavy because every change can touch nodes, edges, and multiple indexes. Running a separate cluster adds backups, upgrades, sizing, and vendor-specific tuning. During an incident, that overhead shows up as stale data and slower investigations. + +### How PuppyGraph Helps + +PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes. + +Instead of migrating data into a specialized store, PuppyGraph connects to sources including [PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), [Apache Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), [Apache Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), [BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq), and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data. From there, you can quickly begin exploring your data with graph queries written in Gremlin or openCypher. + +<figure> +![][image2] +<figcaption>PuppyGraph Supported Data Sources</figcaption> +</figure> + +<figure> +![][image3] +<figcaption>Architecture with Graph Database vs. with PuppyGraph</figcaption> +</figure> + +This approach aligns with the broader shift in modern data stacks to separate compute from storage. You keep data where it belongs and scale query power independently, which supports petabyte-level workloads without duplicating data or managing fragile pipelines. + +## Real-World Use Case + +We have shown why cloud security benefits from a relationship-first view of identities, resources, and events. In this demo, we’ll show how easy it is to begin querying your cloud security data as a graph. Apache Hudi keeps those tables current with streaming upserts and an investigation-friendly timeline. PuppyGraph lets you query your existing lake tables as a graph. Together they give you real-time security graph analytics on the data you already store. + +Getting started is straightforward. You will deploy the stack, load security data into Hudi, connect PuppyGraph to your catalog, define a graph view, and run a few queries. All in a matter of minutes. + +<figure> +![][image4] Review Comment: same for below ########## website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx: ########## @@ -0,0 +1,241 @@ +--- +title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph" +excerpt: "" +author: Jaz Samantha Ku, in collaboration with Shiyan Xu +category: blog +image: /assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png +tags: +- Apache Hudi +- PuppyGraph +- security +--- + +[CrowdStrike’s 2025 Global Threat Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation. + +Security tools such as SIEM, CSPM, and cloud workload protection need relationship-based analysis. It is not only a login attempt or a policy change, but also who acted, which systems were touched, what privileges were active, and what happened next. Event-centric methods struggle to answer those questions at scale. Graph analysis fits better because it captures paths and context across entities. + +To keep up, the data pipeline must support: + +* Continuous upserts with low lag so detections run on the latest state +* Incremental consumption so analytics read only “what changed since T” +* A rewindable timeline so responders can review state during investigations + +With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables. + +## Why Apache Hudi for Cybersecurity Data? + +[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform that brings ACID transaction guarantees to data lakes. It enables efficient, record-level updates and deletes on massive datasets, which makes it a strong foundation for storing and analyzing cybersecurity data such as logs, telemetry, and threat intelligence. Its combination of performance, flexibility, and broad ecosystem integration is well-suited for threat detection, forensic investigation, and compliance work. + +Hudi speeds up large-scale security analytics through features that keep tables both current and query-efficient. Hudi writers excel at handling continuous, mutable workloads without requiring costly full rewrites. Hudi’s [multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed by its internal metadata table, offers efficient lookups and data skipping, dramatically accelerating queries that scan massive log sets to isolate suspicious activity. Hudi keeps tables updatable and queryable as they change, with time-travel and incremental reads for point-in-time forensic analysis. + +Even as data volumes grow, operations remain manageable. Hudi tracks every commit on a timeline, enabling powerful time-travel queries for historical investigations. Asynchronous table services like [compaction](https://hudi.apache.org/docs/compaction), [clustering](https://hudi.apache.org/docs/clustering), [cleaning](https://hudi.apache.org/docs/cleaning), and [indexing](https://hudi.apache.org/docs/metadata_indexing) run in the background to maintain peak performance and storage health while minimizing disruption to ingestion pipelines. Furthermore, its consistent commit and delete semantics support the creation of reliable audit trails, simplify data retention policies, and help meet privacy requirements. + +<figure> + +<figcaption>The [Apache Hudi Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption> +</figure> + + +Hudi also integrates seamlessly with the tools security teams already use. You can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in Hive Metastore or AWS Glue Catalog, and query them from popular query engines like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph connects to the same Hudi tables and runs openCypher or Gremlin queries directly on them via the user access layer, so you get real-time graph analytics on the lake with no ETL and no data duplication. + +## Why PuppyGraph for Cybersecurity Data? + +[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. To understand why this is so important, let’s take a look at the status quo. + +### Traditional Analytics on the Lake + +Security teams already store logs, configs, and alerts in a lakehouse. SQL engines are great for counts, filters, rollups, and point lookups. They struggle when questions depend on relationships. Lateral movement, privilege escalation, and blast radius span many tables and time windows. Each new join adds complexity, pushes latency up, and breaks easily when schemas evolve or events arrive late. You can stitch context with views and pipelines, but it is fragile and slow to adapt. + +### Dedicated Graph Databases + +Graphs make paths and neighborhoods first class. Graph queries let you answer “what connects to what” in a way that makes sense, without the need for confusing data joins. The tradeoff is operations and freshness. Most graph databases want their own storage. That means ETL, a second copy, and lag between source and graph. Continuous upserts are heavy because every change can touch nodes, edges, and multiple indexes. Running a separate cluster adds backups, upgrades, sizing, and vendor-specific tuning. During an incident, that overhead shows up as stale data and slower investigations. + +### How PuppyGraph Helps + +PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes. + +Instead of migrating data into a specialized store, PuppyGraph connects to sources including [PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), [Apache Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), [Apache Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), [BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq), and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data. From there, you can quickly begin exploring your data with graph queries written in Gremlin or openCypher. + +<figure> +![][image2] +<figcaption>PuppyGraph Supported Data Sources</figcaption> +</figure> + +<figure> +![][image3] +<figcaption>Architecture with Graph Database vs. with PuppyGraph</figcaption> +</figure> + +This approach aligns with the broader shift in modern data stacks to separate compute from storage. You keep data where it belongs and scale query power independently, which supports petabyte-level workloads without duplicating data or managing fragile pipelines. + +## Real-World Use Case + +We have shown why cloud security benefits from a relationship-first view of identities, resources, and events. In this demo, we’ll show how easy it is to begin querying your cloud security data as a graph. Apache Hudi keeps those tables current with streaming upserts and an investigation-friendly timeline. PuppyGraph lets you query your existing lake tables as a graph. Together they give you real-time security graph analytics on the data you already store. + +Getting started is straightforward. You will deploy the stack, load security data into Hudi, connect PuppyGraph to your catalog, define a graph view, and run a few queries. All in a matter of minutes. + +<figure> +![][image4] Review Comment: TODO ########## website/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph.mdx: ########## @@ -0,0 +1,241 @@ +--- +title: "Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph" +excerpt: "" +author: Jaz Samantha Ku, in collaboration with Shiyan Xu +category: blog +image: /assets/images/blog/2025-10-02-Real-Time-Cloud-Security-Graphs-Hudi+PuppyGraph/fig-4.png +tags: +- Apache Hudi +- PuppyGraph +- security +--- + +[CrowdStrike’s 2025 Global Threat Report](https://www.crowdstrike.com/en-us/global-threat-report/) puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation. + +Security tools such as SIEM, CSPM, and cloud workload protection need relationship-based analysis. It is not only a login attempt or a policy change, but also who acted, which systems were touched, what privileges were active, and what happened next. Event-centric methods struggle to answer those questions at scale. Graph analysis fits better because it captures paths and context across entities. + +To keep up, the data pipeline must support: + +* Continuous upserts with low lag so detections run on the latest state +* Incremental consumption so analytics read only “what changed since T” +* A rewindable timeline so responders can review state during investigations + +With Apache Hudi and PuppyGraph, this becomes straightforward. Hudi tables support fast upserts and incremental processing. PuppyGraph queries relationships in place using openCypher or Gremlin. In this blog, we explore how to get started with real-time security graph analytics at scale using the data already stored in your Hudi lakehouse tables. + +## Why Apache Hudi for Cybersecurity Data? + +[Apache Hudi](https://hudi.apache.org/) is an open data lakehouse platform that brings ACID transaction guarantees to data lakes. It enables efficient, record-level updates and deletes on massive datasets, which makes it a strong foundation for storing and analyzing cybersecurity data such as logs, telemetry, and threat intelligence. Its combination of performance, flexibility, and broad ecosystem integration is well-suited for threat detection, forensic investigation, and compliance work. + +Hudi speeds up large-scale security analytics through features that keep tables both current and query-efficient. Hudi writers excel at handling continuous, mutable workloads without requiring costly full rewrites. Hudi’s [multi-modal indexing subsystem](https://hudi.apache.org/docs/metadata), backed by its internal metadata table, offers efficient lookups and data skipping, dramatically accelerating queries that scan massive log sets to isolate suspicious activity. Hudi keeps tables updatable and queryable as they change, with time-travel and incremental reads for point-in-time forensic analysis. + +Even as data volumes grow, operations remain manageable. Hudi tracks every commit on a timeline, enabling powerful time-travel queries for historical investigations. Asynchronous table services like [compaction](https://hudi.apache.org/docs/compaction), [clustering](https://hudi.apache.org/docs/clustering), [cleaning](https://hudi.apache.org/docs/cleaning), and [indexing](https://hudi.apache.org/docs/metadata_indexing) run in the background to maintain peak performance and storage health while minimizing disruption to ingestion pipelines. Furthermore, its consistent commit and delete semantics support the creation of reliable audit trails, simplify data retention policies, and help meet privacy requirements. + +<figure> + +<figcaption>The [Apache Hudi Stack](https://hudi.apache.org/docs/hudi_stack)</figcaption> +</figure> + + +Hudi also integrates seamlessly with the tools security teams already use. You can stream data from Apache Kafka or Debezium CDC into Hudi, register tables in Hive Metastore or AWS Glue Catalog, and query them from popular query engines like Apache Spark, Apache Flink, Presto, Trino, or Amazon Athena. PuppyGraph connects to the same Hudi tables and runs openCypher or Gremlin queries directly on them via the user access layer, so you get real-time graph analytics on the lake with no ETL and no data duplication. + +## Why PuppyGraph for Cybersecurity Data? + +[PuppyGraph](https://puppygraph.com/) is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. To understand why this is so important, let’s take a look at the status quo. + +### Traditional Analytics on the Lake + +Security teams already store logs, configs, and alerts in a lakehouse. SQL engines are great for counts, filters, rollups, and point lookups. They struggle when questions depend on relationships. Lateral movement, privilege escalation, and blast radius span many tables and time windows. Each new join adds complexity, pushes latency up, and breaks easily when schemas evolve or events arrive late. You can stitch context with views and pipelines, but it is fragile and slow to adapt. + +### Dedicated Graph Databases + +Graphs make paths and neighborhoods first class. Graph queries let you answer “what connects to what” in a way that makes sense, without the need for confusing data joins. The tradeoff is operations and freshness. Most graph databases want their own storage. That means ETL, a second copy, and lag between source and graph. Continuous upserts are heavy because every change can touch nodes, edges, and multiple indexes. Running a separate cluster adds backups, upgrades, sizing, and vendor-specific tuning. During an incident, that overhead shows up as stale data and slower investigations. + +### How PuppyGraph Helps + +PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes. + +Instead of migrating data into a specialized store, PuppyGraph connects to sources including [PostgreSQL](https://www.puppygraph.com/blog/postgresql-graph-database), [Apache Iceberg](https://docs.puppygraph.com/connecting/connecting-to-iceberg/?h=ice), [Apache Hudi](https://docs.puppygraph.com/connecting/connecting-to-apache-hudi/), [BigQuery](https://docs.puppygraph.com/connecting/connecting-to-bigquery/?h=bigq), and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data. From there, you can quickly begin exploring your data with graph queries written in Gremlin or openCypher. + +<figure> +![][image2] +<figcaption>PuppyGraph Supported Data Sources</figcaption> +</figure> + +<figure> +![][image3] Review Comment: TODO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
