[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruifeng Zheng updated SPARK-44564: ---------------------------------- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to with a prompts (e.g. the attached prompt), you can of course use/design your own prompt. For prompt engineering, you can refer to this [Best practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may still contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * The lint can be broken * ... we need to fix them before sending a PR. We can try different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more examples for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example code can not run * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * The lint can be broken * ... we need to fix them before sending a PR. We can try different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > ----------------------------- > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation > Affects Versions: 4.0.0 > Reporter: Ruifeng Zheng > Priority: Major > Attachments: docstr_prompt.py > > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to with a prompts (e.g. the attached > prompt), you can of course use/design your own prompt. > For prompt engineering, you can refer to this [Best > practices|https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api] > > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > still contain some mistakes, e.g. > * The example code can not run > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * The lint can be broken > * ... > we need to fix them before sending a PR. > We can try different prompts, choose the good parts and combine them to the > new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org