Re: [PR] [Blog Post] Apache Beam for a content discovery platform [beam]

via GitHub Mon, 02 Oct 2023 06:26:02 -0700


damccorm commented on code in PR #28734:
URL: https://github.com/apache/beam/pull/28734#discussion_r1342692433



##########
website/www/site/content/en/blog/dyi-content-discovery-platform-genai-beam.md:
##########
@@ -0,0 +1,327 @@
+---
+layout: post
+title:  "DIY GenAI Content Discovery Platform with Apache Beam"
+date:   2023-09-27 00:00:01 -0800
+categories:
+  - blog
+authors:
+  - pabs
+  - namitasharma
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# DIY GenAI Content Discovery Platform with Apache Beam
+
+Your digital assets, such as documents, PDFs, spreadsheets, and presentations, 
contain a wealth of valuable information, but sometimes it's hard to find what 
you're looking for. This blog post explains how to build a DIY starter 
architecture, based on near real-time ingestion processing and large language 
models (LLMs), to extract meaningful information from your assets. The model 
makes the information available and discoverable through a simple natural 
language query.
+
+Building a near real-time processing pipeline for content ingestion might seem 
like a complex task, and it can be. To make pipeline building easier, the 
Apache Beam framework exposes a set of powerful constructs. These constructs 
remove the following complexities: interacting with multiple types of content 
sources and destinations, error handling, and modularity. They also maintain 
resiliency and scalability with minimal effort. You can use an Apache Beam 
streaming pipeline to complete the following tasks:
+
+- Connect to the many components of a solution.
+- Quickly process content ingestion requests of documents.
+- Make the information in the documents available a few seconds after 
ingestion.
+
+LLMs are often used to extract content and summarize information stored in 
many different places. Organizations can use LLMs to quickly find relevant 
information disseminated in multiple documents written across the years. The 
information might be in different formats, or the documents might be too long 
and complex to read and understand quickly. Use LLMs to process this content to 
make it easier for people to find the information that they need.
+
+Follow the steps in this guide to create a custom scalable solution for data 
extraction, content ingestion, and storage. Learn how to kickstart the 
development of a LLM-based solution using Google Cloud products and generative 
AI offerings. Google Cloud is designed to be simple to use, scalable, and 
flexible, so you can use it as a starting point for further expansion or 
experimentation.
+
+### High Level Flow
+
+From a high level perspective, content uptake and query interactions are 
completely separated. An external content owner should be able to send 
documents (stored in Google Docs or just in binary text format) and expect a 
tracking id for the ingestion request. The ingestion process then will grab the 
document’s content and create chunks (configurable in size) and with each 
document chunks generate embeddings. These embeddings represent the content 
semantics, in the form of a vector of 768 dimensions. Given the document 
identifier (provided at ingestion time) and the chunk identifier we can store 
these embeddings into a Vector DB for later semantic matching. This process is 
central to later contextualizing user inquiries.
+
+<img class="center-block"
+    src="/images/blog/dyi-cdp-genai-beam/cdp-highlevel.png"
+    alt="Content Discovery Platform Overview">
+
+The query resolution process does not depend directly on information 
ingestion. It is expected for the user to receive relevant answers based on the 
content that was ingested until the moment the query was requested, but even in 
the case of having no relevant content stored in the platform the platform 
should return an answer stating exactly that. In general, the query resolution 
process should first generate embeddings from the query content and previously 
existing context (like previous exchanges with the platform), then match these 
embeddings with all the existing embedding vectors stored from the content, and 
in case of having positive matches, retrieve the plain text content represented 
by the content embeddings. Finally with the textual representation of the query 
and the textual representation of the matched content, the platform will 
formulate a request to the LLM to provide a final answer to the original user 
inquiry.
+
+## Components of the solution
+
+The intent is to rely, as much as possible, on the low-ops capabilities of the 
GCP services and to create a set of features that are highly scalable. At a 
high level, the solution can be separated into 2 main components, the service 
layer and the content ingestion pipeline. The service’s layer acts as the entry 
point for document’s ingestion and user queries, it’s a simple set of REST 
resources exposed through CloudRun and implemented using 
[Quarkus](https://quarkus.io/) and the client libraries to access other 
services (VertexAI Models, BigTable and PubSub). In the case of the content 
ingestion pipeline we have:
+
+*   A streaming pipeline that captures user content from wherever it resides.
+*   A process that extracts meaning from this content as a set of 
multi-dimensional vectors (text embeddings).
+*   A storage system that simplifies context matching between knowledge 
content and user inquiries (a Vector Database).
+*   Another storage system that maps knowledge representation with the actual 
content forming the aggregated context of the inquiry.
+*   A model capable of understanding the aggregated context and, through 
prompt engineering, delivering meaningful answers.
+*   HTTP and GRPC based services.
+
+These components work together to provide a comprehensive and simple 
implementation for a content discovery platform.
+
+## Architecture Design
+
+Given the multiple components in play, we will be diving down to explain how 
the different components interact to resolve the two main use cases of the 
platform.
+
+### Component’s Dependencies
+
+The following diagram shows all of the components that the platform integrates 
to capture documents for ingestion and resolve user query requests, also all 
the dependencies existing between the different components of the solution and 
the GCP services in use.
+
+<img class="center-block"
+    src="/images/blog/dyi-cdp-genai-beam/cdp-arch.png"
+    alt="Content Discovery Platform Interactions">
+
+As seen in the diagram, the context-extraction component is the central aspect 
in charge of retrieving the document’s content, also their semantic meaning 
from the embedding’s model and storing the relevant data (chunks text content, 
chunks embeddings, JSON-L content) in the persistent storage systems for later 
use. PubSub resources are the glue between the streaming pipeline and the 
asynchronous processing, capturing the user ingestion requests, retries from 
potential errors from the ingestion pipeline (like the cases on where documents 
have been sent for ingestion but the permission has not been granted yet, 
triggering a retry after some minutes) and content refresh events (periodically 
the pipeline will scan the ingested documents, review the latest editions and 
define if a content refresh should be triggered).
+
+Also, CloudRun plays the important role of exposing the services, which will 
interact with the many different storage systems in use to resolve the user 
query requests. For example, capturing the semantic meaning from the user’s 
query by interacting with the embedding’s model, finding near matches from the 
Vertex AI Vector Search (formerly Matching Engine),which stores the embedding 
vectors from the ingested document’s content, retrieving the text content from 
BigTable to contextualize finally the request to the VertexAI Text-Bison and 
Chat-Bison models for a final response to the user’s originary query.

Review Comment:
   "to contextualize finally the request to the VertexAI Text-Bison and 
Chat-Bison models for a final response to the user’s originary query."
   
   I don't follow what this is trying to say - is it just saying "to send the 
request to the VertexAI..."? I'd suggest rewording.
   
   Also, I don't think we need to change this, but it is worth calling out that 
Beam has built in support for calling vertex AI endpoints (e.g. here's an 
example calling into a tuned vertex text-bison model - 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/inference/vertex_ai_llm_text_classification.py).
 Maybe we could at least call out that integration point in this blog since I 
imagine it would be interesting to the audience



##########
website/www/site/content/en/blog/dyi-content-discovery-platform-genai-beam.md:
##########
@@ -0,0 +1,327 @@
+---
+layout: post
+title:  "DIY GenAI Content Discovery Platform with Apache Beam"
+date:   2023-09-27 00:00:01 -0800
+categories:
+  - blog
+authors:
+  - pabs
+  - namitasharma
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# DIY GenAI Content Discovery Platform with Apache Beam
+
+Your digital assets, such as documents, PDFs, spreadsheets, and presentations, 
contain a wealth of valuable information, but sometimes it's hard to find what 
you're looking for. This blog post explains how to build a DIY starter 
architecture, based on near real-time ingestion processing and large language 
models (LLMs), to extract meaningful information from your assets. The model 
makes the information available and discoverable through a simple natural 
language query.
+
+Building a near real-time processing pipeline for content ingestion might seem 
like a complex task, and it can be. To make pipeline building easier, the 
Apache Beam framework exposes a set of powerful constructs. These constructs 
remove the following complexities: interacting with multiple types of content 
sources and destinations, error handling, and modularity. They also maintain 
resiliency and scalability with minimal effort. You can use an Apache Beam 
streaming pipeline to complete the following tasks:
+
+- Connect to the many components of a solution.
+- Quickly process content ingestion requests of documents.
+- Make the information in the documents available a few seconds after 
ingestion.
+
+LLMs are often used to extract content and summarize information stored in 
many different places. Organizations can use LLMs to quickly find relevant 
information disseminated in multiple documents written across the years. The 
information might be in different formats, or the documents might be too long 
and complex to read and understand quickly. Use LLMs to process this content to 
make it easier for people to find the information that they need.
+
+Follow the steps in this guide to create a custom scalable solution for data 
extraction, content ingestion, and storage. Learn how to kickstart the 
development of a LLM-based solution using Google Cloud products and generative 
AI offerings. Google Cloud is designed to be simple to use, scalable, and 
flexible, so you can use it as a starting point for further expansion or 
experimentation.
+
+### High Level Flow
+
+From a high level perspective, content uptake and query interactions are 
completely separated. An external content owner should be able to send 
documents (stored in Google Docs or just in binary text format) and expect a 
tracking id for the ingestion request. The ingestion process then will grab the 
document’s content and create chunks (configurable in size) and with each 
document chunks generate embeddings. These embeddings represent the content 
semantics, in the form of a vector of 768 dimensions. Given the document 
identifier (provided at ingestion time) and the chunk identifier we can store 
these embeddings into a Vector DB for later semantic matching. This process is 
central to later contextualizing user inquiries.
+
+<img class="center-block"
+    src="/images/blog/dyi-cdp-genai-beam/cdp-highlevel.png"
+    alt="Content Discovery Platform Overview">
+
+The query resolution process does not depend directly on information 
ingestion. It is expected for the user to receive relevant answers based on the 
content that was ingested until the moment the query was requested, but even in 
the case of having no relevant content stored in the platform the platform 
should return an answer stating exactly that. In general, the query resolution 
process should first generate embeddings from the query content and previously 
existing context (like previous exchanges with the platform), then match these 
embeddings with all the existing embedding vectors stored from the content, and 
in case of having positive matches, retrieve the plain text content represented 
by the content embeddings. Finally with the textual representation of the query 
and the textual representation of the matched content, the platform will 
formulate a request to the LLM to provide a final answer to the original user 
inquiry.
+
+## Components of the solution
+
+The intent is to rely, as much as possible, on the low-ops capabilities of the 
GCP services and to create a set of features that are highly scalable. At a 
high level, the solution can be separated into 2 main components, the service 
layer and the content ingestion pipeline. The service’s layer acts as the entry 
point for document’s ingestion and user queries, it’s a simple set of REST 
resources exposed through CloudRun and implemented using 
[Quarkus](https://quarkus.io/) and the client libraries to access other 
services (VertexAI Models, BigTable and PubSub). In the case of the content 
ingestion pipeline we have:
+
+*   A streaming pipeline that captures user content from wherever it resides.
+*   A process that extracts meaning from this content as a set of 
multi-dimensional vectors (text embeddings).
+*   A storage system that simplifies context matching between knowledge 
content and user inquiries (a Vector Database).
+*   Another storage system that maps knowledge representation with the actual 
content forming the aggregated context of the inquiry.
+*   A model capable of understanding the aggregated context and, through 
prompt engineering, delivering meaningful answers.
+*   HTTP and GRPC based services.
+
+These components work together to provide a comprehensive and simple 
implementation for a content discovery platform.
+
+## Architecture Design
+
+Given the multiple components in play, we will be diving down to explain how 
the different components interact to resolve the two main use cases of the 
platform.
+
+### Component’s Dependencies
+
+The following diagram shows all of the components that the platform integrates 
to capture documents for ingestion and resolve user query requests, also all 
the dependencies existing between the different components of the solution and 
the GCP services in use.
+
+<img class="center-block"
+    src="/images/blog/dyi-cdp-genai-beam/cdp-arch.png"
+    alt="Content Discovery Platform Interactions">
+
+As seen in the diagram, the context-extraction component is the central aspect 
in charge of retrieving the document’s content, also their semantic meaning 
from the embedding’s model and storing the relevant data (chunks text content, 
chunks embeddings, JSON-L content) in the persistent storage systems for later 
use. PubSub resources are the glue between the streaming pipeline and the 
asynchronous processing, capturing the user ingestion requests, retries from 
potential errors from the ingestion pipeline (like the cases on where documents 
have been sent for ingestion but the permission has not been granted yet, 
triggering a retry after some minutes) and content refresh events (periodically 
the pipeline will scan the ingested documents, review the latest editions and 
define if a content refresh should be triggered).
+
+Also, CloudRun plays the important role of exposing the services, which will 
interact with the many different storage systems in use to resolve the user 
query requests. For example, capturing the semantic meaning from the user’s 
query by interacting with the embedding’s model, finding near matches from the 
Vertex AI Vector Search (formerly Matching Engine),which stores the embedding 
vectors from the ingested document’s content, retrieving the text content from 
BigTable to contextualize finally the request to the VertexAI Text-Bison and 
Chat-Bison models for a final response to the user’s originary query.

Review Comment:
   "to contextualize finally the request to the VertexAI Text-Bison and 
Chat-Bison models for a final response to the user’s originary query."
   
   I don't follow what this is trying to say - is it just saying "to send the 
request to the VertexAI..."? I'd suggest rewording.
   
   Also, I don't think we need to change this, but it is worth calling out that 
Beam has built in support for calling vertex AI endpoints (e.g. here's an 
example calling into a tuned vertex text-bison model - 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/inference/vertex_ai_llm_text_classification.py).
 Maybe we could at least call out that integration point in this blog since I 
imagine it would be interesting to the audience



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Blog Post] Apache Beam for a content discovery platform [beam]

Reply via email to