[
https://issues.apache.org/jira/browse/HIVE-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sreenath updated HIVE-27743:
----------------------------
Description:
_Semantic search is the tech power *vector databases,* and we can have the same
power in Hive._
Semantic search is a way for computers to understand the meaning behind words
and phrases when you're searching for something. Instead of just looking for
exact matches of keywords, it tries to figure out what you're really asking and
provides results that are more relevant and meaningful to your question. It's
like having a search engine that can understand what you mean, not just what
you say, making it easier to find the information you're looking for. This
ticket is to have Semantic search in Hive as UDFs.
The proposal is to implement functions for on-the-fly calculation of similarity
distance between two values. Once we have them we could easily do semantic
search as part of a where clause.
* Eg (using a cosine similarity function): “WHERE cos_sim(region, 'europe') >
0.9“. And it could return records with regions like Scandinavia, Nordic, Baltic
etc…
* We could have functions thats accept values as text or as vector embeddings.
*On the implementation side, we can have a set of new UDFs and configuration
properties:*
*UDFs:*
# encode(sentences[, prompt, embedding_type, normalize_embeddings])
# cos_sim(a, b)
# dot_score(a, b)
# euclidean_sim(a, b)
# manhattan_sim(a, b)
*Configuration properties:*
# hive.embedding.model - Path to a pre-trained SentenceTransformer model
# hive.embedding.batch_size - The batch size used for the computation
# hive.embedding.precision - The precision to use for the embeddings. Can be
“float32”, “int8”, “uint8”, “binary”, or “ubinary”
# hive.embedding.default_prompt - Prompt prefix that must be used by default
# hive.embedding.cache_folder - Path to a local folder to store models
was:
_Semantic search is the tech power *vector databases,* and we can have the same
power in Hive._
Semantic search is a way for computers to understand the meaning behind words
and phrases when you're searching for something. Instead of just looking for
exact matches of keywords, it tries to figure out what you're really asking and
provides results that are more relevant and meaningful to your question. It's
like having a search engine that can understand what you mean, not just what
you say, making it easier to find the information you're looking for. This
ticket is a wish to have Semantic search in Hive.
On the implementation side, semantic search uses an embedding model and any of
the similarity distance functions.
My proposal is to implement functions for on-the-fly calculation of similarity
distance between two values. Once we have them we could easily do semantic
search as part of a where clause.
* Eg (using a cosine similarity function): “WHERE cos_dist(region, 'europe') >
0.9“. And it could return records with regions like Scandinavia, Nordic, Baltic
etc…
* We could have functions thats accept values as text or as vector embeddings.
> Semantic Search In Hive
> -----------------------
>
> Key: HIVE-27743
> URL: https://issues.apache.org/jira/browse/HIVE-27743
> Project: Hive
> Issue Type: Wish
> Environment: *
> Reporter: Sreenath
> Assignee: Sreenath
> Priority: Major
>
> _Semantic search is the tech power *vector databases,* and we can have the
> same power in Hive._
> Semantic search is a way for computers to understand the meaning behind words
> and phrases when you're searching for something. Instead of just looking for
> exact matches of keywords, it tries to figure out what you're really asking
> and provides results that are more relevant and meaningful to your question.
> It's like having a search engine that can understand what you mean, not just
> what you say, making it easier to find the information you're looking for.
> This ticket is to have Semantic search in Hive as UDFs.
> The proposal is to implement functions for on-the-fly calculation of
> similarity distance between two values. Once we have them we could easily do
> semantic search as part of a where clause.
> * Eg (using a cosine similarity function): “WHERE cos_sim(region, 'europe')
> > 0.9“. And it could return records with regions like Scandinavia, Nordic,
> Baltic etc…
> * We could have functions thats accept values as text or as vector
> embeddings.
> *On the implementation side, we can have a set of new UDFs and configuration
> properties:*
> *UDFs:*
> # encode(sentences[, prompt, embedding_type, normalize_embeddings])
> # cos_sim(a, b)
> # dot_score(a, b)
> # euclidean_sim(a, b)
> # manhattan_sim(a, b)
> *Configuration properties:*
> # hive.embedding.model - Path to a pre-trained SentenceTransformer model
> # hive.embedding.batch_size - The batch size used for the computation
> # hive.embedding.precision - The precision to use for the embeddings. Can be
> “float32”, “int8”, “uint8”, “binary”, or “ubinary”
> # hive.embedding.default_prompt - Prompt prefix that must be used by default
> # hive.embedding.cache_folder - Path to a local folder to store models
--
This message was sent by Atlassian Jira
(v8.20.10#820010)