[ 
https://issues.apache.org/jira/browse/TIKA-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16601694#comment-16601694
 ] 

ASF GitHub Bot commented on TIKA-2720:
--------------------------------------

ThejanW commented on issue #248: Fix for TIKA-2720 [WIP]
URL: https://github.com/apache/tika/pull/248#issuecomment-417965702
 
 
   The sentences in the above comment are parsed through the encoder, and it 
outputs an array containing 512 floats each and every sentence. Once I have 
that, I calculates the cosine similarity between each and every array I get for 
sentences and here are the highest matched sentence couples with their cosine 
similarities. 
   At each segment, you will find the two sentences and then the cosine 
similarity. For an example in the first segment, we have the sentences, "How 
old are you?" and "What is your age?" having a cosine similarity of 
0.8516871929168701, which is the highest, the list goes on...
   
   ```
   How old are you? 
   What is your age?
   0.8516871929168701
   
   How old are you?
   How old did you turn?
   0.7483202219009399
   
   What is your age?
   How old did you turn?
   0.6784225106239319
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying 
flights, forming sinkholes
   Recently a lot of hurricanes have hit the US
   0.6395097374916077
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 
2019
   Android beats iOS in smartphone loyalty, study finds
   0.6229119300842285
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying 
flights, forming sinkholes
   News showed, violent floodwaters surging down main Streets
   0.6069092154502869
   
   How old are you?
   When is your birthday?
   0.5812650322914124
   
   What is your age?
   When is your birthday?
   0.5723845362663269
   
   Android beats iOS in smartphone loyalty, study finds
   Apple became the world’s first trillion-dollar public company
   0.5713004469871521
   
   Green tea contains bioactive compounds that improve health
   Is paleo better than keto?
   0.5498321652412415
   
   News showed, violent floodwaters surging down main Streets
   Recently a lot of hurricanes have hit the US
   0.534430205821991
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 
2019
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire 
front of the phone.
   0.5117762088775635
   
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying 
flights, forming sinkholes
   Multiple lines of scientific evidence show that the climate system is warming
   0.5018186569213867
   
   Android beats iOS in smartphone loyalty, study finds
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire 
front of the phone.
   0.4970431923866272
   
   Green tea contains bioactive compounds that improve health
   Yoga has been shown to help people reduce anxiety
   0.4776824116706848
   
   How old did you turn?
   When is your birthday?
   0.46567195653915405
   
   The Samsung Galaxy S10 has the potential to be the most exciting phone of 
2019
   Apple became the world’s first trillion-dollar public company
   0.4522799849510193
   
   Recently a lot of hurricanes have hit the US
   Multiple lines of scientific evidence show that the climate system is warming
   0.4517837166786194
   
   With roads covered with slippery snow and ice, can challenge even the most 
experienced driver.
   Heavy rain slammed the mid-Atlantic United States on Monday, delaying 
flights, forming sinkholes
   0.42890870571136475
   
   An ounce of prevention is worth a pound of cure
   Green tea contains bioactive compounds that improve health
   0.38761529326438904
   
   An ounce of prevention is worth a pound of cure
   Yoga has been shown to help people reduce anxiety
   0.38396507501602173
   
   News showed, violent floodwaters surging down main Streets
   Multiple lines of scientific evidence show that the climate system is warming
   0.3623693287372589
   
   IPhone X includes a 5.8-inch edge-to-edge display which covers the entire 
front of the phone.
   Apple became the world’s first trillion-dollar public company
   0.361715167760849
   
   With roads covered with slippery snow and ice, can challenge even the most 
experienced driver.
   News showed, violent floodwaters surging down main Streets
   0.35203033685684204
   
   Yoga has been shown to help people reduce anxiety
   Is paleo better than keto?
   0.34740278124809265
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> A parser to output universal sentence encodings to text
> -------------------------------------------------------
>
>                 Key: TIKA-2720
>                 URL: https://issues.apache.org/jira/browse/TIKA-2720
>             Project: Tika
>          Issue Type: New Feature
>          Components: tika-dl
>            Reporter: Thejan Wijesinghe
>            Priority: Major
>             Fix For: 2.0
>
>
> This parser encodes a text into high dimensional vectors that can be used for 
> text classification, semantic similarity, clustering and other natural 
> language tasks. The model is trained and optimized for greater-than-word 
> length text, such as sentences, phrases or short paragraphs. It is trained on 
> a variety of data sources and a variety of tasks with the aim of dynamically 
> accommodating a wide variety of natural language understanding tasks. The 
> input is variable length English text and the output is a 512 dimensional 
> vector.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to