Re: [I] Support parse pdf to structured data (Parser + Normalization). [seatunnel]

via GitHub Mon, 01 Sep 2025 07:13:45 -0700


Hisoka-X commented on issue #9716:
URL: https://github.com/apache/seatunnel/issues/9716#issuecomment-3242536822


   > @Hisoka-X For this task, we have previously implemented a processing 
program for obtaining text through pdf parsing. Then we will perform sharding, 
embedding and writing to the vector database based on the obtained text for 
RAG. Is the process to be implemented this time also similar to this? I want to 
try to accomplish this task.
   
   Yes, this is the feature we also want seatunnel can do.
   
   > Also, I would like to know what function Normalization aims to achieve?
   
   Make sure markdown/pdf/words return same CatalogTable. Please refer 
https://github.com/apache/seatunnel/pull/9760#discussion_r2306274831


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Support parse pdf to structured data (Parser + Normalization). [seatunnel]

Reply via email to