damccorm commented on code in PR #28734: URL: https://github.com/apache/beam/pull/28734#discussion_r1342692433
########## website/www/site/content/en/blog/dyi-content-discovery-platform-genai-beam.md: ########## @@ -0,0 +1,327 @@ +--- +layout: post +title: "DIY GenAI Content Discovery Platform with Apache Beam" +date: 2023-09-27 00:00:01 -0800 +categories: + - blog +authors: + - pabs + - namitasharma +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# DIY GenAI Content Discovery Platform with Apache Beam + +Your digital assets, such as documents, PDFs, spreadsheets, and presentations, contain a wealth of valuable information, but sometimes it's hard to find what you're looking for. This blog post explains how to build a DIY starter architecture, based on near real-time ingestion processing and large language models (LLMs), to extract meaningful information from your assets. The model makes the information available and discoverable through a simple natural language query. + +Building a near real-time processing pipeline for content ingestion might seem like a complex task, and it can be. To make pipeline building easier, the Apache Beam framework exposes a set of powerful constructs. These constructs remove the following complexities: interacting with multiple types of content sources and destinations, error handling, and modularity. They also maintain resiliency and scalability with minimal effort. You can use an Apache Beam streaming pipeline to complete the following tasks: + +- Connect to the many components of a solution. +- Quickly process content ingestion requests of documents. +- Make the information in the documents available a few seconds after ingestion. + +LLMs are often used to extract content and summarize information stored in many different places. Organizations can use LLMs to quickly find relevant information disseminated in multiple documents written across the years. The information might be in different formats, or the documents might be too long and complex to read and understand quickly. Use LLMs to process this content to make it easier for people to find the information that they need. + +Follow the steps in this guide to create a custom scalable solution for data extraction, content ingestion, and storage. Learn how to kickstart the development of a LLM-based solution using Google Cloud products and generative AI offerings. Google Cloud is designed to be simple to use, scalable, and flexible, so you can use it as a starting point for further expansion or experimentation. + +### High Level Flow + +From a high level perspective, content uptake and query interactions are completely separated. An external content owner should be able to send documents (stored in Google Docs or just in binary text format) and expect a tracking id for the ingestion request. The ingestion process then will grab the document’s content and create chunks (configurable in size) and with each document chunks generate embeddings. These embeddings represent the content semantics, in the form of a vector of 768 dimensions. Given the document identifier (provided at ingestion time) and the chunk identifier we can store these embeddings into a Vector DB for later semantic matching. This process is central to later contextualizing user inquiries. + +<img class="center-block" + src="/images/blog/dyi-cdp-genai-beam/cdp-highlevel.png" + alt="Content Discovery Platform Overview"> + +The query resolution process does not depend directly on information ingestion. It is expected for the user to receive relevant answers based on the content that was ingested until the moment the query was requested, but even in the case of having no relevant content stored in the platform the platform should return an answer stating exactly that. In general, the query resolution process should first generate embeddings from the query content and previously existing context (like previous exchanges with the platform), then match these embeddings with all the existing embedding vectors stored from the content, and in case of having positive matches, retrieve the plain text content represented by the content embeddings. Finally with the textual representation of the query and the textual representation of the matched content, the platform will formulate a request to the LLM to provide a final answer to the original user inquiry. + +## Components of the solution + +The intent is to rely, as much as possible, on the low-ops capabilities of the GCP services and to create a set of features that are highly scalable. At a high level, the solution can be separated into 2 main components, the service layer and the content ingestion pipeline. The service’s layer acts as the entry point for document’s ingestion and user queries, it’s a simple set of REST resources exposed through CloudRun and implemented using [Quarkus](https://quarkus.io/) and the client libraries to access other services (VertexAI Models, BigTable and PubSub). In the case of the content ingestion pipeline we have: + +* A streaming pipeline that captures user content from wherever it resides. +* A process that extracts meaning from this content as a set of multi-dimensional vectors (text embeddings). +* A storage system that simplifies context matching between knowledge content and user inquiries (a Vector Database). +* Another storage system that maps knowledge representation with the actual content forming the aggregated context of the inquiry. +* A model capable of understanding the aggregated context and, through prompt engineering, delivering meaningful answers. +* HTTP and GRPC based services. + +These components work together to provide a comprehensive and simple implementation for a content discovery platform. + +## Architecture Design + +Given the multiple components in play, we will be diving down to explain how the different components interact to resolve the two main use cases of the platform. + +### Component’s Dependencies + +The following diagram shows all of the components that the platform integrates to capture documents for ingestion and resolve user query requests, also all the dependencies existing between the different components of the solution and the GCP services in use. + +<img class="center-block" + src="/images/blog/dyi-cdp-genai-beam/cdp-arch.png" + alt="Content Discovery Platform Interactions"> + +As seen in the diagram, the context-extraction component is the central aspect in charge of retrieving the document’s content, also their semantic meaning from the embedding’s model and storing the relevant data (chunks text content, chunks embeddings, JSON-L content) in the persistent storage systems for later use. PubSub resources are the glue between the streaming pipeline and the asynchronous processing, capturing the user ingestion requests, retries from potential errors from the ingestion pipeline (like the cases on where documents have been sent for ingestion but the permission has not been granted yet, triggering a retry after some minutes) and content refresh events (periodically the pipeline will scan the ingested documents, review the latest editions and define if a content refresh should be triggered). + +Also, CloudRun plays the important role of exposing the services, which will interact with the many different storage systems in use to resolve the user query requests. For example, capturing the semantic meaning from the user’s query by interacting with the embedding’s model, finding near matches from the Vertex AI Vector Search (formerly Matching Engine),which stores the embedding vectors from the ingested document’s content, retrieving the text content from BigTable to contextualize finally the request to the VertexAI Text-Bison and Chat-Bison models for a final response to the user’s originary query. Review Comment: "to contextualize finally the request to the VertexAI Text-Bison and Chat-Bison models for a final response to the user’s originary query." I don't follow what this is trying to say - is it just saying "to send the request to the VertexAI..."? I'd suggest rewording. Also, I don't think we need to change this, but it is worth calling out that Beam has built in support for calling vertex AI endpoints (e.g. here's an example calling into a tuned vertex text-bison model - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/inference/vertex_ai_llm_text_classification.py). Maybe we could at least call out that integration point in this blog since I imagine it would be interesting to the audience ########## website/www/site/content/en/blog/dyi-content-discovery-platform-genai-beam.md: ########## @@ -0,0 +1,327 @@ +--- +layout: post +title: "DIY GenAI Content Discovery Platform with Apache Beam" +date: 2023-09-27 00:00:01 -0800 +categories: + - blog +authors: + - pabs + - namitasharma +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# DIY GenAI Content Discovery Platform with Apache Beam + +Your digital assets, such as documents, PDFs, spreadsheets, and presentations, contain a wealth of valuable information, but sometimes it's hard to find what you're looking for. This blog post explains how to build a DIY starter architecture, based on near real-time ingestion processing and large language models (LLMs), to extract meaningful information from your assets. The model makes the information available and discoverable through a simple natural language query. + +Building a near real-time processing pipeline for content ingestion might seem like a complex task, and it can be. To make pipeline building easier, the Apache Beam framework exposes a set of powerful constructs. These constructs remove the following complexities: interacting with multiple types of content sources and destinations, error handling, and modularity. They also maintain resiliency and scalability with minimal effort. You can use an Apache Beam streaming pipeline to complete the following tasks: + +- Connect to the many components of a solution. +- Quickly process content ingestion requests of documents. +- Make the information in the documents available a few seconds after ingestion. + +LLMs are often used to extract content and summarize information stored in many different places. Organizations can use LLMs to quickly find relevant information disseminated in multiple documents written across the years. The information might be in different formats, or the documents might be too long and complex to read and understand quickly. Use LLMs to process this content to make it easier for people to find the information that they need. + +Follow the steps in this guide to create a custom scalable solution for data extraction, content ingestion, and storage. Learn how to kickstart the development of a LLM-based solution using Google Cloud products and generative AI offerings. Google Cloud is designed to be simple to use, scalable, and flexible, so you can use it as a starting point for further expansion or experimentation. + +### High Level Flow + +From a high level perspective, content uptake and query interactions are completely separated. An external content owner should be able to send documents (stored in Google Docs or just in binary text format) and expect a tracking id for the ingestion request. The ingestion process then will grab the document’s content and create chunks (configurable in size) and with each document chunks generate embeddings. These embeddings represent the content semantics, in the form of a vector of 768 dimensions. Given the document identifier (provided at ingestion time) and the chunk identifier we can store these embeddings into a Vector DB for later semantic matching. This process is central to later contextualizing user inquiries. + +<img class="center-block" + src="/images/blog/dyi-cdp-genai-beam/cdp-highlevel.png" + alt="Content Discovery Platform Overview"> + +The query resolution process does not depend directly on information ingestion. It is expected for the user to receive relevant answers based on the content that was ingested until the moment the query was requested, but even in the case of having no relevant content stored in the platform the platform should return an answer stating exactly that. In general, the query resolution process should first generate embeddings from the query content and previously existing context (like previous exchanges with the platform), then match these embeddings with all the existing embedding vectors stored from the content, and in case of having positive matches, retrieve the plain text content represented by the content embeddings. Finally with the textual representation of the query and the textual representation of the matched content, the platform will formulate a request to the LLM to provide a final answer to the original user inquiry. + +## Components of the solution + +The intent is to rely, as much as possible, on the low-ops capabilities of the GCP services and to create a set of features that are highly scalable. At a high level, the solution can be separated into 2 main components, the service layer and the content ingestion pipeline. The service’s layer acts as the entry point for document’s ingestion and user queries, it’s a simple set of REST resources exposed through CloudRun and implemented using [Quarkus](https://quarkus.io/) and the client libraries to access other services (VertexAI Models, BigTable and PubSub). In the case of the content ingestion pipeline we have: + +* A streaming pipeline that captures user content from wherever it resides. +* A process that extracts meaning from this content as a set of multi-dimensional vectors (text embeddings). +* A storage system that simplifies context matching between knowledge content and user inquiries (a Vector Database). +* Another storage system that maps knowledge representation with the actual content forming the aggregated context of the inquiry. +* A model capable of understanding the aggregated context and, through prompt engineering, delivering meaningful answers. +* HTTP and GRPC based services. + +These components work together to provide a comprehensive and simple implementation for a content discovery platform. + +## Architecture Design + +Given the multiple components in play, we will be diving down to explain how the different components interact to resolve the two main use cases of the platform. + +### Component’s Dependencies + +The following diagram shows all of the components that the platform integrates to capture documents for ingestion and resolve user query requests, also all the dependencies existing between the different components of the solution and the GCP services in use. + +<img class="center-block" + src="/images/blog/dyi-cdp-genai-beam/cdp-arch.png" + alt="Content Discovery Platform Interactions"> + +As seen in the diagram, the context-extraction component is the central aspect in charge of retrieving the document’s content, also their semantic meaning from the embedding’s model and storing the relevant data (chunks text content, chunks embeddings, JSON-L content) in the persistent storage systems for later use. PubSub resources are the glue between the streaming pipeline and the asynchronous processing, capturing the user ingestion requests, retries from potential errors from the ingestion pipeline (like the cases on where documents have been sent for ingestion but the permission has not been granted yet, triggering a retry after some minutes) and content refresh events (periodically the pipeline will scan the ingested documents, review the latest editions and define if a content refresh should be triggered). + +Also, CloudRun plays the important role of exposing the services, which will interact with the many different storage systems in use to resolve the user query requests. For example, capturing the semantic meaning from the user’s query by interacting with the embedding’s model, finding near matches from the Vertex AI Vector Search (formerly Matching Engine),which stores the embedding vectors from the ingested document’s content, retrieving the text content from BigTable to contextualize finally the request to the VertexAI Text-Bison and Chat-Bison models for a final response to the user’s originary query. Review Comment: "to contextualize finally the request to the VertexAI Text-Bison and Chat-Bison models for a final response to the user’s originary query." I don't follow what this is trying to say - is it just saying "to send the request to the VertexAI..."? I'd suggest rewording. Also, I don't think we need to change this, but it is worth calling out that Beam has built in support for calling vertex AI endpoints (e.g. here's an example calling into a tuned vertex text-bison model - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/inference/vertex_ai_llm_text_classification.py). Maybe we could at least call out that integration point in this blog since I imagine it would be interesting to the audience -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
